[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Best Practice for designing XML vocabularies containingacc

  • From: Michael Kay <mike@saxonica.com>
  • To: xml-dev@lists.xml.org
  • Date: Sat, 02 Feb 2013 22:46:52 +0000

Re:  Best Practice for designing XML vocabularies containingacc

Roger, stop reinventing the wheel. This is all known territory you are 
exploring. Read

http://www.w3.org/TR/charmod-norm/

and if you think it's wrong, tell us why.

Michael Kay
Saxonica


On 02/02/2013 19:03, Costello, Roger L. wrote:
> Hi Folks,
>
> I propose the following as Best Practice:
>
> 	For elements and attributes that have accents,
> 	allow users to express them in either composed
> 	normalized form (NFC) or decomposed normalized
> 	form (NFD).
>
> Example: suppose that your XML vocabulary is to contain this element:
>
> 	<résumé>
>
> Notice the two accented characters.
>
> There are two standard, canonical ways to express those accented characters:
>
> 1. Normalization Form Composed (NFC): the accented character is expressed as a single composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE)
>
> 2. Normalization Form Decomposed (NFD): the accented character is expressed as a decomposed sequence to two characters (U+65 LATIN SMALL LETTER E, U+301 COMBINING ACUTE ACCENT)
>
> In the following XML document the first <résumé> element is expressed using NFC. The second is expressed using NFD:
>
> 	<?xml version="1.0" encoding="UTF-8"?>
> 	<Test>
> 	        <résumé>____</résumé>
> 	        <reìsumeì>____</reìsumeì>
> 	</Test>
>
> The two <résumé> elements appear the same, don’t they? That’s a neat thing about NFC and NFD -- visualization tools display them the same way.
>
> In order for users to express accented elements and attributes in either NFC or NFD, design your XML Schemas using a <xs:choice> element. In the following XSD snippet the first résumé is NFC and the second is NFD:
>
>              <xs:choice>
>                  <xs:element name="résumé" type="xs:string" />
>                  <xs:element name="reìsumeì" type="xs:string" />
>              </xs:choice>
>
> By designing your schemas in this fashion you empower your instance document authors to use whatever normalization form they prefer (or their tools prefer).
>
> I inquired on the Unicode mailing list about NFD. Here are my notes on their responses:
>
> Most text exchanged on the Internet is NFC-encoded. However, you can't count on text to always be NFC-encoded. In fact, there are definite advantages to NFD-encoding text.
>
> Some operating systems store filenames in NFD encoding.
>
> It’s easier to remember a handful of useful composing accents than the much larger number of combined forms.
>
> NFD makes the regular expressions used to qualify its contents much, *much* simpler.  I imagine that things like fuzzy text matching are easier in NFD.
>
> There are well-documented cases of, for example, keyboards that generate de-normalized sequences, file systems that use other forms, and tools which generate content that is not normalized. This content enters the Web in a non-NFC state.
>
> It is easier to use a few keystrokes for combining accents than to set up compose key sequences for all the possible composed characters.
>
> It’s easier to do searches and other text processing on NFD-encoded text.
>
> Some Unicode-defined processes, such as capitalization, are not guaranteed to preserve normalization forms. So the result of converting a lowercase character in NFC may be a decomposed uppercase character sequence (i.e., NFD).
>
> Thoughts?
>
> /Roger
>
>      I have approximate answers and possible beliefs
>      in different degrees of certainty about different
>      things, but I'm not absolutely sure of anything.
>
>                                               Richard Feynman
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
>



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.