[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Microsoft FUD on binary XML...
Alaric B Snell <alaric@a...> wrote at Fri, 21 Nov 2003 11:14:14 +0000: > Rick Jelliffe wrote: > > > Also, it would interesting to see binary people use Chinese (Japanese or > > Korean) text > > and markup for their test data. Compressing or packing ASCII is quite > > different to > > compressing or packing UTF-16 Chinese, which has a more random-seeming > > distribution > > of byte values. It is not dishonest to make the case for binary using > > data that > > is most compressible; but businesses who are looking at compression > > strategies > > for world-wide use need to factor in CJK compressability into their > > evaluations. > > That only makes a difference if you're actually compressing the text > fields - most binary interchange formats will just write the text in > UTF-8 and leave it at that; however lower-level byte sequence Changing UTF-16 Chinese to UTF-8 means a 50% size increase for the Chinese characters in the Basic Multilingual Plane (i.e., most of the Chinese characters in the message) since as UTF-16, one Chinese character is 16 bits, and as UTF-8, one Chinese character is three bytes. Only characters in the ASCII range take less space as UTF-8 than UTF-16. It's 1:1 for € to ߿ and for 𐀀 and above, but for ࠀ to  (excluding � to �), which includes the most frequently used Chinese, Japanese, and Korean characters, UTF-8 uses three bytes. > compressors will just see the text as bytes rather than as characters. > I've yet to see an implementation of the deflate algorithm (as used by > gzip) for UCS-4 codepoints rather than just bytes, but it could be done > and would be very interesting (but if you use a wide range of characters > in the input, your Huffman tree will be a bit memory-intensive! :-) Regards, Tony Graham ------------------------------------------------------------------------ XML Technology Center - Dublin Sun Microsystems Ireland Ltd Phone: +353 1 8199708 Hamilton House, East Point Business Park, Dublin 3 x(70)19708
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|