[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Microsoft FUD on binary XML...


xml accented characters
Tony Graham wrote:

> Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the
> Chinese characters in the Basic Multilingual Plane (i.e., most of the
> Chinese characters in the message) since as UTF-16, one Chinese
> character is 16 bits, and as UTF-8, one Chinese character is three
> bytes.

Exactly - efficient representation of Unicode text currently sadly 
involves the user or the application doing a frequency analysis and 
deciding whether to use UTF-8 or UTF-16... I think very, very, few do 
this right now; UTF-8 seems the almost ubiquitous choice, mainly due to 
the software industry being driven from places that use the Roman alphabet.

Perhaps we need a new UTF that loses many of UTF-8s nice properties with 
respect to lexical sorting and so on, but is less discriminatory against 
character sets that live far into the BMP, perhaps working along the 
lines of:

Code points 0..127 represented as-is.

Code points 128+ represented by switching mode; to start a sequence of 
up to 128 wide characters, output a byte consisting of 128 + (length-1), 
then that many UTF-16 characters (in network byte order).

Plus some canonicalisation requirements, like the system must not have 
two sequences of wide characters next to each other unless the first one 
is 128 characters long (so there is no choice in how you split up blocks 
of more than 128 wide characters; you must output sequences of 128 
characters until there are less than 128 left).

That way text that was all out of the 0..127 range would only be 
penalised by an extra byte per 256 bytes (128 characters). Pure US-ASCII 
would still come out as pure US-ASCII so it'd be readable in legacy viewers.

People who use pound signs and accented characters, like us Europeans, 
would see each such symbol taking 3 bytes, but they currently take 2 
bytes in UTF-8 and occur only occasionally interspersed with US-ASCII 
characters anyway, so the hit would be nowhere near as bad as the hit 
UTF-8 incurs for the Chinese and their neighbours.

> 
> Regards,
> 
> 
> Tony Graham

ABS


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.