Best Practice for designing XML vocabularies containing accentedcharacte

From: "Costello, Roger L." <costello@mitre.org>
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Sat, 2 Feb 2013 19:03:49 +0000

Play the video

Hi Folks,

I propose the following as Best Practice:

	For elements and attributes that have accents,
	allow users to express them in either composed
	normalized form (NFC) or decomposed normalized
	form (NFD).

Example: suppose that your XML vocabulary is to contain this element:

	<résumé>

Notice the two accented characters. 

There are two standard, canonical ways to express those accented characters:

1. Normalization Form Composed (NFC): the accented character is expressed as a single composed character (U+E9 LATIN SMALL LETTER E WITH ACUTE)

2. Normalization Form Decomposed (NFD): the accented character is expressed as a decomposed sequence to two characters (U+65 LATIN SMALL LETTER E, U+301 COMBINING ACUTE ACCENT)

In the following XML document the first <résumé> element is expressed using NFC. The second is expressed using NFD:

	<?xml version="1.0" encoding="UTF-8"?>
	<Test>
	        <résumé>____</résumé>
	        <reìsumeì>____</reìsumeì>
	</Test>

The two <résumé> elements appear the same, dont they? Thats a neat thing about NFC and NFD -- visualization tools display them the same way.

In order for users to express accented elements and attributes in either NFC or NFD, design your XML Schemas using a <xs:choice> element. In the following XSD snippet the first résumé is NFC and the second is NFD:

            <xs:choice>
                <xs:element name="résumé" type="xs:string" />
                <xs:element name="reìsumeì" type="xs:string" />
            </xs:choice>

By designing your schemas in this fashion you empower your instance document authors to use whatever normalization form they prefer (or their tools prefer).

I inquired on the Unicode mailing list about NFD. Here are my notes on their responses:

Most text exchanged on the Internet is NFC-encoded. However, you can't count on text to always be NFC-encoded. In fact, there are definite advantages to NFD-encoding text.

Some operating systems store filenames in NFD encoding.

Its easier to remember a handful of useful composing accents than the much larger number of combined forms.

NFD makes the regular expressions used to qualify its contents much, *much* simpler.  I imagine that things like fuzzy text matching are easier in NFD.

There are well-documented cases of, for example, keyboards that generate de-normalized sequences, file systems that use other forms, and tools which generate content that is not normalized. This content enters the Web in a non-NFC state.

It is easier to use a few keystrokes for combining accents than to set up compose key sequences for all the possible composed characters.

Its easier to do searches and other text processing on NFD-encoded text.

Some Unicode-defined processes, such as capitalization, are not guaranteed to preserve normalization forms. So the result of converting a lowercase character in NFC may be a decomposed uppercase character sequence (i.e., NFD).

Thoughts?

/Roger

    I have approximate answers and possible beliefs 
    in different degrees of certainty about different 
    things, but I'm not absolutely sure of anything.

                                             Richard Feynman

Follow-Ups:
- Re: Best Practice for designing XML vocabularies containingaccented characters -- allow both composed and decomposed forms
  - From: Michael Kay <mike@saxonica.com>
- Re: Best Practice for designing XML vocabularies containing accented characters -- allow both composed and decomposed forms
  - From: Jim Melton <jim.melton@oracle.com>
- Re: Best Practice for designing XML vocabularies containing accented characters -- allow both composed and decomposed forms
  - From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Re: Best Practice for designing XML vocabulariescontaining accented characters -- allow both composed and decomposed forms
  - From: Liam R E Quin <liam@w3.org>
- RE: Best Practice for designing XML vocabularies containingaccented characters -- allow both composed and decomposed forms
  - From: David Lee <dlee@calldei.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >