[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Best Practice for designing XML vocabularies containing ac

  • From: Jim Melton <jim.melton@oracle.com>
  • To: "Costello, Roger L." <costello@mitre.org>
  • Date: Sat, 02 Feb 2013 14:03:50 -0700

Re:  Best Practice for designing XML vocabularies containing ac
Roger,

This is, IMHO, a Really Bad Idea.  It would be 
far better to automatically (e.g., via a script) 
normalize all input documents before validating or otherwise processing them.

Your proposal addresses only a tiny fraction of 
the possible character-based "gotchas" and 
probably not the most important fraction, either.

Hope this helps,
    Jim


At 2/2/2013 12:03 PM, Costello, Roger L. wrote:
>Hi Folks,
>
>I propose the following as Best Practice:
>
>     For elements and attributes that have accents,
>     allow users to express them in either composed
>     normalized form (NFC) or decomposed normalized
>     form (NFD).
>
>Example: suppose that your XML vocabulary is to contain this element:
>
>     <résumé>
>
>Notice the two accented characters.
>
>There are two standard, canonical ways to express those accented characters:
>
>1. Normalization Form Composed (NFC): the 
>accented character is expressed as a single 
>composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE)
>
>2. Normalization Form Decomposed (NFD): the 
>accented character is expressed as a decomposed 
>sequence to two characters (U+65 LATIN SMALL 
>LETTER E, U+301 COMBINING ACUTE ACCENT)
>
>In the following XML document the first <résumé> 
>element is expressed using NFC. The second is expressed using NFD:
>
>     <?xml version="1.0" encoding="UTF-8"?>
>     <Test>
>             <résumé>____</résumé>
>             <reìsumeì>____</reìsumeì>
>     </Test>
>
>The two <résumé> elements appear the same, don’t 
>they? That’s a neat thing about NFC and NFD -- 
>visualization tools display them the same way.
>
>In order for users to express accented elements 
>and attributes in either NFC or NFD, design your 
>XML Schemas using a <xs:choice> element. In the 
>following XSD snippet the first résumé is NFC and the second is NFD:
>
>             <xs:choice>
>                 <xs:element name="résumé" type="xs:string" />
>                 <xs:element name="reìsumeì" type="xs:string" />
>             </xs:choice>
>
>By designing your schemas in this fashion you 
>empower your instance document authors to use 
>whatever normalization form they prefer (or their tools prefer).
>
>I inquired on the Unicode mailing list about 
>NFD. Here are my notes on their responses:
>
>Most text exchanged on the Internet is 
>NFC-encoded. However, you can't count on text to 
>always be NFC-encoded. In fact, there are 
>definite advantages to NFD-encoding text.
>
>Some operating systems store filenames in NFD encoding.
>
>It’s easier to remember a handful of useful 
>composing accents than the much larger number of combined forms.
>
>NFD makes the regular expressions used to 
>qualify its contents much, *much* simpler.  I 
>imagine that things like fuzzy text matching are easier in NFD.
>
>There are well-documented cases of, for example, 
>keyboards that generate de-normalized sequences, 
>file systems that use other forms, and tools 
>which generate content that is not normalized. 
>This content enters the Web in a non-NFC state.
>
>It is easier to use a few keystrokes for 
>combining accents than to set up compose key 
>sequences for all the possible composed characters.
>
>It’s easier to do searches and other text processing on NFD-encoded text.
>
>Some Unicode-defined processes, such as 
>capitalization, are not guaranteed to preserve 
>normalization forms. So the result of converting 
>a lowercase character in NFC may be a decomposed 
>uppercase character sequence (i.e., NFD).
>
>Thoughts?
>
>/Roger
>
>     I have approximate answers and possible beliefs
>     in different degrees of certainty about different
>     things, but I'm not absolutely sure of anything.
>
>                                              Richard Feynman
>
>_______________________________________________________________________
>
>XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>to support XML implementation and development. To minimize
>spam in the archives, you must subscribe before posting.
>
>[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>subscribe: xml-dev-subscribe@lists.xml.org
>List archive: http://lists.xml.org/archives/xml-dev/
>List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
========================================================================
=  Facts are facts.   But any opinions expressed are the opinions      =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
========================================================================  



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.