[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Microsoft FUD on binary XML...
At 1:36 PM +0000 11/21/03, Alaric B Snell wrote: >People who use pound signs and accented characters, like us >Europeans, would see each such symbol taking 3 bytes, but they >currently take 2 bytes in UTF-8 and occur only occasionally >interspersed with US-ASCII characters anyway, so the hit would be >nowhere near as bad as the hit UTF-8 incurs for the Chinese and >their neighbours. > One should keep in mind that Chinese and similar languages are quite compressed to start with, far more so than English text is. For example, in UTF-8 the English word "tree" takes four bytes. The Japanese word for tree takes three bytes. The English word "grove" takes five bytes. The Japanese word for grove takes three bytes. The English word "forest" takes six bytes. The Japanese word for forest still takes only three bytes. I don't know the Japanese word for antidisestablishmentarianism, but whatever it is, it's probably a lot smaller than the English one. Comparing alphabetic languages to ideographic ones is really apples to oranges. Word for word, Chinese documents tend to be smaller, even in UTF-8. -- Elliotte Rusty Harold elharo@m... Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|