[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] UTF's considered best practice [was: Re: nextml]
Amy, and all: At 10:18 PM -0700 12/8/10, Uche Ogbuji wrote: >On Wed, Dec 8, 2010 at 9:27 PM, Amelia A Lewis <amyzing@talsever.com> wrote: > > >I've > >seen a number of "only UTF" comments, and I think that they're rather > >western-centric, so I'm thinking "no," there (if someone whose native > >language *isn't* west european proposes it, I might rethink) > > >Rick Jelliffe brings one of the most complete and coherent >Eastern/Western perspectives I've ever encountered, and his proposal >says: > >"A Nuke document is UTF-8 in its external form. Inside a program, >after parsing, it would typically use UTF16." > >Yes, we all know about the politics and inertia that have affected >uptake of Unicode in some geographies, but the "UTF-8 or UTF-16" is >there for a very strong pragmatic reason. Dealing with a pretty >open-ended world of character sets, as in XML 1.0 is one of the >biggest factors that complicate and slow down parsers, even if you >get someone else (e.g. ICU) to do the relatively hard bits.... I don't know much about XML (which is why I lurk here and learn), but I do know something about internationalisation. Amy, I applaud your caution against western-centric limitations to any nextml. I'm with Uche is saying that limiting any nextml proposal to Unicode Transformation Formats (UTF-8, UTF-16BE, UTF-16LE) are good internationalisation, not western-centric. In contrast, any other text encoding will lock out some languages or other. Best internationalisation practice is to process text in Unicode, and convert into a Unicode format on input, and convert back (if needed) on output. I'm a regular attendee at the Internationalisation and Unicode Conferences, and this is the consistent recommendation. See: "Handling character encodings in HTML and CSS" <http://www.w3.org/International/tutorials/tutorial-char-enc/> "Unicode nearing 50% of the web" Key quote: "[Google has] long used Unicode as the internal format for all the text we search: any other encoding is first converted to Unicode for processing." <http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html> (2010/01/28) For nextml, I think it's fine to limit document encodings to UTF-8 only, or UTF-8 plus UTF-16. Let the generators and consumers transcode to other character sets if they think it important. 10 years ago that wasn't a reasonable stance to take; documents encoded in Unicode were rare. But now, more than 50% of the web is in Unicode: <http://twitter.com/mark_e_davis/statuses/22673110887> (2010/08/31) [Mark Davis is Internationalization Architect for Google, and President of the Unicode Consortium. He knows his stuff.] Sometimes UTF-16 is a more compact representation, sometimes UTF-8 is. It depends on the frequency distribution of characters in the document. But they have equivalent descriptive power; either can represent any sequence of Unicode characters. If nextml adopts UTF-16, be aware that it can be serialised to bytes in either little-endian or big-endian order (UTF-16LE or UTF-16BE), so nextml should account for those possibilities. It should also allow for the special Byte-Order Mark character (BOM), which is used to distinguish the two. See also: "Benefits of the Unicode Character Standard" <http://www.i18nguy.com/UnicodeBenefits.html> "Unicode in XML and other Markup Languages" <http://www.unicode.org/reports/tr20/> <http://www.w3.org/TR/unicode-xml/> "Best Practices for XML Internationalization" <http://www.w3.org/TR/xml-i18n-bp/> So, even though my native language is western european, I hope you'll reconsider saying "yes" to UTF-8 and/or UTF-16 only for nextml. At 10:18 PM -0700 12/8/10, Uche Ogbuji continued: ... >If we want to have a strong diversity of well-performing and >conforming tools, which I suspect is an important component of >success for most of us considering XML-NG, I think "UTF-*-only" is >the simple reality. For me, UTF-8 or UTF-16 is certainly an >improvement over JSON's UTF-8 only. > >I'm curious as to how that JSON limitation is affecting trends in >text processing conventions in non-Western countries as "Web 2.0" >becomes pervasive. -- --Jim DeLaHunt, jdlh@jdlh.com http://blog.jdlh.com/ (http://jdlh.com/) multilingual websites consultant 157-2906 West Broadway, Vancouver BC V6K 2G8, Canada Canada mobile +1-604-376-8953
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|