[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Best Practice for designing XML vocabularies containing accentedcharacte

  • From: "Costello, Roger L." <costello@mitre.org>
  • To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
  • Date: Sat, 2 Feb 2013 19:03:49 +0000

Best Practice for designing XML vocabularies containing accentedcharacte
Hi Folks,

I propose the following as Best Practice:

	For elements and attributes that have accents,
	allow users to express them in either composed
	normalized form (NFC) or decomposed normalized
	form (NFD).

Example: suppose that your XML vocabulary is to contain this element:

	<résumé>

Notice the two accented characters. 

There are two standard, canonical ways to express those accented characters:

1. Normalization Form Composed (NFC): the accented character is expressed as a single composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE)

2. Normalization Form Decomposed (NFD): the accented character is expressed as a decomposed sequence to two characters (U+65 LATIN SMALL LETTER E, U+301 COMBINING ACUTE ACCENT)

In the following XML document the first <résumé> element is expressed using NFC. The second is expressed using NFD:

	<?xml version="1.0" encoding="UTF-8"?>
	<Test>
	        <résumé>____</résumé>
	        <reìsumeì>____</reìsumeì>
	</Test>

The two <résumé> elements appear the same, don’t they? That’s a neat thing about NFC and NFD -- visualization tools display them the same way.

In order for users to express accented elements and attributes in either NFC or NFD, design your XML Schemas using a <xs:choice> element. In the following XSD snippet the first résumé is NFC and the second is NFD:

            <xs:choice>
                <xs:element name="résumé" type="xs:string" />
                <xs:element name="reìsumeì" type="xs:string" />
            </xs:choice>

By designing your schemas in this fashion you empower your instance document authors to use whatever normalization form they prefer (or their tools prefer).

I inquired on the Unicode mailing list about NFD. Here are my notes on their responses:

Most text exchanged on the Internet is NFC-encoded. However, you can't count on text to always be NFC-encoded. In fact, there are definite advantages to NFD-encoding text.

Some operating systems store filenames in NFD encoding.

It’s easier to remember a handful of useful composing accents than the much larger number of combined forms.

NFD makes the regular expressions used to qualify its contents much, *much* simpler.  I imagine that things like fuzzy text matching are easier in NFD.

There are well-documented cases of, for example, keyboards that generate de-normalized sequences, file systems that use other forms, and tools which generate content that is not normalized. This content enters the Web in a non-NFC state.

It is easier to use a few keystrokes for combining accents than to set up compose key sequences for all the possible composed characters.

It’s easier to do searches and other text processing on NFD-encoded text.

Some Unicode-defined processes, such as capitalization, are not guaranteed to preserve normalization forms. So the result of converting a lowercase character in NFC may be a decomposed uppercase character sequence (i.e., NFD).

Thoughts?

/Roger

    I have approximate answers and possible beliefs 
    in different degrees of certainty about different 
    things, but I'm not absolutely sure of anything.

                                             Richard Feynman


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.