[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Best Practice for designing XML vocabularies containing accentedcharacte
Hi Folks, I propose the following as Best Practice: For elements and attributes that have accents, allow users to express them in either composed normalized form (NFC) or decomposed normalized form (NFD). Example: suppose that your XML vocabulary is to contain this element: <résumé> Notice the two accented characters. There are two standard, canonical ways to express those accented characters: 1. Normalization Form Composed (NFC): the accented character is expressed as a single composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE) 2. Normalization Form Decomposed (NFD): the accented character is expressed as a decomposed sequence to two characters (U+65 LATIN SMALL LETTER E, U+301 COMBINING ACUTE ACCENT) In the following XML document the first <résumé> element is expressed using NFC. The second is expressed using NFD: <?xml version="1.0" encoding="UTF-8"?> <Test> <résumé>____</résumé> <reìsumeì>____</reìsumeì> </Test> The two <résumé> elements appear the same, dont they? Thats a neat thing about NFC and NFD -- visualization tools display them the same way. In order for users to express accented elements and attributes in either NFC or NFD, design your XML Schemas using a <xs:choice> element. In the following XSD snippet the first résumé is NFC and the second is NFD: <xs:choice> <xs:element name="résumé" type="xs:string" /> <xs:element name="reìsumeì" type="xs:string" /> </xs:choice> By designing your schemas in this fashion you empower your instance document authors to use whatever normalization form they prefer (or their tools prefer). I inquired on the Unicode mailing list about NFD. Here are my notes on their responses: Most text exchanged on the Internet is NFC-encoded. However, you can't count on text to always be NFC-encoded. In fact, there are definite advantages to NFD-encoding text. Some operating systems store filenames in NFD encoding. Its easier to remember a handful of useful composing accents than the much larger number of combined forms. NFD makes the regular expressions used to qualify its contents much, *much* simpler. I imagine that things like fuzzy text matching are easier in NFD. There are well-documented cases of, for example, keyboards that generate de-normalized sequences, file systems that use other forms, and tools which generate content that is not normalized. This content enters the Web in a non-NFC state. It is easier to use a few keystrokes for combining accents than to set up compose key sequences for all the possible composed characters. Its easier to do searches and other text processing on NFD-encoded text. Some Unicode-defined processes, such as capitalization, are not guaranteed to preserve normalization forms. So the result of converting a lowercase character in NFC may be a decomposed uppercase character sequence (i.e., NFD). Thoughts? /Roger I have approximate answers and possible beliefs in different degrees of certainty about different things, but I'm not absolutely sure of anything. Richard Feynman
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|