UTF's considered best practice [was: Re: nextml]

From: Jim DeLaHunt <from.xml-dev@jdlh.com>
To: Amelia A Lewis <amyzing@talsever.com>, Uche Ogbuji <uche@o...>
Date: Wed, 8 Dec 2010 22:31:02 -0800

Play the video

Amy, and all:

At 10:18 PM -0700 12/8/10, Uche Ogbuji wrote:
>On Wed, Dec 8, 2010 at 9:27 PM, Amelia A Lewis <amyzing@talsever.com> wrote:
>
>  >I've
>  >seen a number of "only UTF" comments, and I think that they're rather
>  >western-centric, so I'm thinking "no," there (if someone whose native
>  >language *isn't* west european proposes it, I might rethink)
>
>
>Rick Jelliffe brings one of the most complete and coherent
>Eastern/Western perspectives I've ever encountered, and his proposal
>says:
>
>"A Nuke document is UTF-8 in its external form. Inside a program,
>after parsing, it would typically use UTF16."
>
>Yes, we all know about the politics and inertia that have affected
>uptake of Unicode in some geographies, but the "UTF-8 or UTF-16" is
>there for a very strong pragmatic reason.  Dealing with a pretty
>open-ended world of character sets, as in XML 1.0 is one of the
>biggest factors that complicate and slow down parsers, even if you
>get someone else (e.g. ICU) to do the relatively hard bits....

I don't know much about XML (which is why I lurk here and learn), but 
I do know something about internationalisation.  Amy, I applaud your 
caution against western-centric limitations to any nextml.  I'm with 
Uche is saying that limiting any nextml proposal to Unicode 
Transformation Formats (UTF-8, UTF-16BE, UTF-16LE) are good 
internationalisation, not western-centric.  In contrast, any other 
text encoding will lock out some languages or other.

Best internationalisation practice is to process text in Unicode, and 
convert into a Unicode format on input, and convert back (if needed) 
on output.  I'm a regular attendee at the Internationalisation and 
Unicode Conferences, and this is the consistent recommendation. See:

"Handling character encodings in HTML and CSS"
<http://www.w3.org/International/tutorials/tutorial-char-enc/>

"Unicode nearing 50% of the web"
Key quote: "[Google has] long used Unicode as the internal format for 
all the text we search: any other encoding is first converted to 
Unicode for processing."
<http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html> 
(2010/01/28)

For nextml, I think it's fine to limit document encodings to UTF-8 
only, or UTF-8 plus UTF-16.  Let the generators and consumers 
transcode to other character sets if they think it important.  10 
years ago that wasn't a reasonable stance to take; documents encoded 
in Unicode were rare.  But now, more than 50% of the web is in 
Unicode:
<http://twitter.com/mark_e_davis/statuses/22673110887> (2010/08/31)
[Mark Davis is Internationalization Architect for Google, and 
President of the Unicode Consortium. He knows his stuff.]

Sometimes UTF-16 is a more compact representation, sometimes UTF-8 
is. It depends on the frequency distribution of characters in the 
document. But they have equivalent descriptive power; either can 
represent any sequence of Unicode characters.  If nextml adopts 
UTF-16, be aware that it can be serialised to bytes in either 
little-endian or big-endian order (UTF-16LE or UTF-16BE), so nextml 
should account for those possibilities. It should also allow for the 
special Byte-Order Mark character (BOM), which is used to distinguish 
the two.

See also:
"Benefits of the Unicode Character Standard" 
<http://www.i18nguy.com/UnicodeBenefits.html>

"Unicode in XML and other Markup Languages" 
<http://www.unicode.org/reports/tr20/>
<http://www.w3.org/TR/unicode-xml/>

"Best Practices for XML Internationalization" 
<http://www.w3.org/TR/xml-i18n-bp/>

So, even though my native language is western european, I hope you'll 
reconsider saying "yes" to UTF-8 and/or UTF-16 only for nextml.

At 10:18 PM -0700 12/8/10, Uche Ogbuji continued:
...
>If we want to have a strong diversity of well-performing and
>conforming tools, which I suspect is an important component of
>success for most of us considering XML-NG, I think "UTF-*-only" is
>the simple reality.  For me, UTF-8 or UTF-16 is certainly an
>improvement over JSON's UTF-8 only.
>
>I'm curious as to how that JSON limitation is affecting trends in
>text processing conventions in non-Western countries as "Web 2.0"
>becomes pervasive.

-- 
     --Jim DeLaHunt, jdlh@jdlh.com     http://blog.jdlh.com/ (http://jdlh.com/)
       multilingual websites consultant

       157-2906 West Broadway, Vancouver BC V6K 2G8, Canada
          Canada mobile +1-604-376-8953

References:
- nextml
  - From: Amelia A Lewis <amyzing@talsever.com>
- Re: nextml
  - From: Uche Ogbuji <uche@ogbuji.net>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >