[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Best Practice for designing XML vocabularies containing ac
Roger, This is, IMHO, a Really Bad Idea. It would be far better to automatically (e.g., via a script) normalize all input documents before validating or otherwise processing them. Your proposal addresses only a tiny fraction of the possible character-based "gotchas" and probably not the most important fraction, either. Hope this helps, Jim At 2/2/2013 12:03 PM, Costello, Roger L. wrote: >Hi Folks, > >I propose the following as Best Practice: > > For elements and attributes that have accents, > allow users to express them in either composed > normalized form (NFC) or decomposed normalized > form (NFD). > >Example: suppose that your XML vocabulary is to contain this element: > > <résumé> > >Notice the two accented characters. > >There are two standard, canonical ways to express those accented characters: > >1. Normalization Form Composed (NFC): the >accented character is expressed as a single >composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE) > >2. Normalization Form Decomposed (NFD): the >accented character is expressed as a decomposed >sequence to two characters (U+65 LATIN SMALL >LETTER E, U+301 COMBINING ACUTE ACCENT) > >In the following XML document the first <résumé> >element is expressed using NFC. The second is expressed using NFD: > > <?xml version="1.0" encoding="UTF-8"?> > <Test> > <résumé>____</résumé> > <reìsumeì>____</reìsumeì> > </Test> > >The two <résumé> elements appear the same, dont >they? Thats a neat thing about NFC and NFD -- >visualization tools display them the same way. > >In order for users to express accented elements >and attributes in either NFC or NFD, design your >XML Schemas using a <xs:choice> element. In the >following XSD snippet the first résumé is NFC and the second is NFD: > > <xs:choice> > <xs:element name="résumé" type="xs:string" /> > <xs:element name="reìsumeì" type="xs:string" /> > </xs:choice> > >By designing your schemas in this fashion you >empower your instance document authors to use >whatever normalization form they prefer (or their tools prefer). > >I inquired on the Unicode mailing list about >NFD. Here are my notes on their responses: > >Most text exchanged on the Internet is >NFC-encoded. However, you can't count on text to >always be NFC-encoded. In fact, there are >definite advantages to NFD-encoding text. > >Some operating systems store filenames in NFD encoding. > >Its easier to remember a handful of useful >composing accents than the much larger number of combined forms. > >NFD makes the regular expressions used to >qualify its contents much, *much* simpler. I >imagine that things like fuzzy text matching are easier in NFD. > >There are well-documented cases of, for example, >keyboards that generate de-normalized sequences, >file systems that use other forms, and tools >which generate content that is not normalized. >This content enters the Web in a non-NFC state. > >It is easier to use a few keystrokes for >combining accents than to set up compose key >sequences for all the possible composed characters. > >Its easier to do searches and other text processing on NFD-encoded text. > >Some Unicode-defined processes, such as >capitalization, are not guaranteed to preserve >normalization forms. So the result of converting >a lowercase character in NFC may be a decomposed >uppercase character sequence (i.e., NFD). > >Thoughts? > >/Roger > > I have approximate answers and possible beliefs > in different degrees of certainty about different > things, but I'm not absolutely sure of anything. > > Richard Feynman > >_______________________________________________________________________ > >XML-DEV is a publicly archived, unmoderated list hosted by OASIS >to support XML implementation and development. To minimize >spam in the archives, you must subscribe before posting. > >[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ >Or unsubscribe: xml-dev-unsubscribe@lists.xml.org >subscribe: xml-dev-subscribe@lists.xml.org >List archive: http://lists.xml.org/archives/xml-dev/ >List Guidelines: http://www.oasis-open.org/maillists/guidelines.php ======================================================================== Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345 Oracle Corporation Oracle Email: jim dot melton at oracle dot com 1930 Viscounti Drive Alternate email: jim dot melton at acm dot org Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com ======================================================================== = Facts are facts. But any opinions expressed are the opinions = = only of myself and may or may not reflect the opinions of anybody = = else with whom I may or may not have discussed the issues at hand. = ========================================================================
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|