[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: An XML document is not well-formed ifencoding="..." does
David: [re-send, including the xml-dev list] At 2:53 AM +0000 12/30/12, David Lee wrote: >For people who use languages which have predominantly non-latin codepoints ... >Is UTF8 actually worse than UTF32 - file size wise ? No, I believe not. Deducing from the definition of UTF-8 and UTF-32, there is no sequence of Unicode character values for which the UTF-8 representation requires more bytes than the UTF-32 representation. On the contrary, in all but pathological cases the UTF-8 representation will require fewer bytes. The best answer to the Stack Overflow question, "at all times text encoded in UTF-8 will never give us more than a +50% file size of the same text encoded in UTF-16. true / false?", http://stackoverflow.com/questions/6883434/at-all-times-text-encoded-in-utf-8-will-never-give-us-more-than-a-50-file-size , has a case study comparing the number of characters and UTF8 bytes for the text content of several language versions of the Wikipedia "Tokyo" article. Extending the results table there a bit, we see that the ratio of bytes-for-UTF-8 / bytes-for-UTF-32 ranged from a high of 65% (for Japanese) to a low of 26% (for English, Spanish, and French). While we're at it, note that the ratio of bytes-for-UTF-8 / bytes-for-UTF-16 ranged from a high of 129% (again for Japanese) to a low of 51% (for English). Actually, Japanese, Korean and simplified Chinese were the only languages in the sample where UTF-8 took more bytes than UTF-16. For Traditional Chinese and all other languages in the sample, UTF-8 was more compact. >And does it matter much ? I would say, with just a little bit of snark, that anyone choosing to mark up their document with an XML language has already declared they don't care much about file size being bloated. :-) But there are other factors in choosing a Unicode Transformation Format (UTF) to represent text. For some applications, UTF-32's 1:1 mapping of code unit to character might valuable. >Considering that UTF16 is a dangerous file format, (I agree it is ... ) Personally, I don't concede that point. It's harder to use it with tools that assume byte-aligned code units. But there are many tools which are happy to work with 16-bit code units. >I dont think any convention that requires you to have read "the >Beginning" will consistently work with text ... >XML suffers with this assumption as well with the XML declaration >declaring the encoding. >That only works when you have an entire document to look at. ... I very much agree with this observation. -- --Jim DeLaHunt, jdlh@jdlh.com http://blog.jdlh.com/ (http://jdlh.com/) multilingual websites consultant 157-2906 West Broadway, Vancouver BC V6K 2G8, Canada Canada mobile +1-604-376-8953
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|