RE: XML Binary and Compression
At 9:50 am -0500 11/3/03, Elliotte Rusty Harold wrote: >At 11:54 PM -0500 3/10/03, winkowski@m... wrote: >>Rusty, >>The conclusions drawn were explicitly caveated by the fact that only one >>document type was tested. The binary message set was generated based on a >>standard which was designed for narrow bandwidth transmission. I agree that >>this paper would not have passed a peer reviewed journal as the dataset is >>not generally available and your skepticism is justified. However, I don't >>know what you mean by the statement "we can't tell whether the data set used >>to produce these results is similar to the sorts of XML data we're working >>with or not" since XML documents in the wild exhibit a wide range of >>characteristics (flat, deep, structured, unstructured). >> > >It's not that complex. I have my documents that I'm interested. You have yours. Walter perry has his. Robin Berjon has his. They are similar in some respects and dissimilar in others. Your results may be applicable to my needs (or Walter's, or Robin's or other peoples) or they may not, depending on how closely the formats you're measuring map to the documents we use. However, since we can't look at your documents there's no way for us to tell. We simply don't know whether your results are meaningful in our environment or not. You could try reproducing the experiment with your own data :-) That's what I did. I can't release my test data either, because it contains personal information and financial information that cannot easily be sanitised without destroying the validity of the results (I'm just interested in the maximum compression available from readily available tools for bulk, structurally repetitive data in a real high volume application). For the record, I used a 1.3Mb file of structurally repetitive, but otherwise variable "real world" XML data. Each repetition occupies roughly 1.1kbs and is moderately structured (elements nested maybe three to four deep in places) with tag names chosen to be readable rather than terse. The bulk of the data is monetary values (expressed to two decimal places) and personal id info (names, DoBs, id numbers). Gzip -9 (ie best compression) reduces the dataset to 5.3% of its original size. Xmill -9 reduces the dataset to 3.48% of its original size. The ability to get roughly 50% more data into a given bandwidth is not to be sneezed at, especially given an initial starting point of a near 20-fold reduction in bandwidth requirements. I know this probably doesn't help you with your own data any more than the original paper did, but I trust someone may find this small endorsement of the Xmill approach useful... -- Andy Greener Mob: +44 7836 331933 GID Ltd, Reading, UK Tel: +44 118 956 1248 andy@g... Fax: +44 118 958 9005
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format