|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Microsoft FUD on binary XML...
Elliotte Rusty Harold wrote: > One should keep in mind that Chinese and similar languages are quite > compressed to start with, far more so than English text is. For > example, in UTF-8 > the English word "tree" takes four bytes. The Japanese word for tree > takes three > bytes. Word for word, Chinese documents tend to be smaller, even in > UTF-8. Sure, but the point is that anyone who says "Look at how much size reduction we can get with our binary/compression system!" (i.e., on documents with significant text portions) should be shouted at "You figures are for ASCII data and markup, please come back when you have figures that also demonstrate the characteristics for non-Latin data and non-Latin markup." Similarly, we should largely ignore all benchmarks which do not include at least 50% of document data in non-Latin scripts. If someone is making a test suite or a sample to allow a benchmark index to be created for comparison purposes, I suggest something like the following mix would be useful: 25% ASCII text (English, Bahasa, etc) 25% Accented Latin (French, German, Polish, etc) 25% CJK (including at least 5% traditional chinese, 5% simplified, 5% Japanese, 5% Korean) 25% Other, e.g. any mix of Greek, Russian, Indic, Arabic, Hebrew And where about half of the each group of non-ASCII samples use non-Latin characters in markup. Just because ideographs are terser than alphabetic letters does not mean that there is any less value to their users in compressing them. UTF-8 has not proved popular in CJK countries AFAIKS because of the 50% penalty compared to regional encodings: transmission and storage size is always important. Non-Latin requirements in general, and CJK requirements in particular, should not be an afterthought for benchmarking, crumbs given to the dogs under the table after we have finished our feast IYKWIM. I am sure that no-one thinks that way, but the issue deserves to be raised: people always assume that the particular issues they face are universal. I have not finished reading all the papers from the W3C meeting, but I have not seen any mention of this issue so far. Maybe everyone is just shipping around numbers? My recollection from some conference is that writers (of Trad Chinese and Japanese) rarely have more than a 3000-character vocabulary (even if just because people write about a topic, so there are many words that won't appear in the same discourse as others: "crepuscular" probably doesn't appear in any military tank manual (pedants get googling now!), nor "kangaroo" in books on US quasi-extra-terratorial military legal practise (though maybe it should). Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








