|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Microsoft FUD on binary XML...
Tony Graham wrote: > Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the > Chinese characters in the Basic Multilingual Plane (i.e., most of the > Chinese characters in the message) since as UTF-16, one Chinese > character is 16 bits, and as UTF-8, one Chinese character is three > bytes. Exactly - efficient representation of Unicode text currently sadly involves the user or the application doing a frequency analysis and deciding whether to use UTF-8 or UTF-16... I think very, very, few do this right now; UTF-8 seems the almost ubiquitous choice, mainly due to the software industry being driven from places that use the Roman alphabet. Perhaps we need a new UTF that loses many of UTF-8s nice properties with respect to lexical sorting and so on, but is less discriminatory against character sets that live far into the BMP, perhaps working along the lines of: Code points 0..127 represented as-is. Code points 128+ represented by switching mode; to start a sequence of up to 128 wide characters, output a byte consisting of 128 + (length-1), then that many UTF-16 characters (in network byte order). Plus some canonicalisation requirements, like the system must not have two sequences of wide characters next to each other unless the first one is 128 characters long (so there is no choice in how you split up blocks of more than 128 wide characters; you must output sequences of 128 characters until there are less than 128 left). That way text that was all out of the 0..127 range would only be penalised by an extra byte per 256 bytes (128 characters). Pure US-ASCII would still come out as pure US-ASCII so it'd be readable in legacy viewers. People who use pound signs and accented characters, like us Europeans, would see each such symbol taking 3 bytes, but they currently take 2 bytes in UTF-8 and occur only occasionally interspersed with US-ASCII characters anyway, so the hit would be nowhere near as bad as the hit UTF-8 incurs for the Chinese and their neighbours. > > Regards, > > > Tony Graham ABS
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








