[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: XML Binary and Compression

dataset xml binary
At 9:50 am -0500 11/3/03, Elliotte Rusty Harold wrote:
>At 11:54 PM -0500 3/10/03, winkowski@m... wrote:
>>The conclusions drawn were explicitly caveated by the fact that only one
>>document  type was tested. The binary message set was generated based on a
>>standard which was designed for narrow bandwidth transmission. I agree that
>>this paper would not have passed a peer reviewed journal as the dataset is
>>not generally available and your skepticism is justified. However, I don't
>>know what you mean by the statement "we can't tell whether the data set used
>>to produce these results is similar to the sorts of XML data we're working
>>with or not" since XML documents in the wild exhibit a wide range of
>>characteristics (flat, deep, structured, unstructured).
>It's not that complex. I have my documents that I'm interested. You have yours. Walter perry has his. Robin Berjon has his.  They are similar in some respects and dissimilar in others. Your results may be applicable to my needs (or Walter's, or Robin's or other peoples) or they may not, depending on how closely the formats you're measuring map to the documents we use. However, since we can't look at your documents there's no way for us to tell. We simply don't know whether your results are meaningful in our environment or not.

You could try reproducing the experiment with your own data :-) That's
what I did. I can't release my test data either, because it contains
personal information and financial information that cannot easily be sanitised
without destroying the validity of the results (I'm just interested in the
maximum compression available from readily available tools for bulk,
structurally repetitive data in a real high volume application).

For the record, I used a 1.3Mb file of structurally repetitive, but otherwise
variable "real world" XML data. Each repetition occupies roughly 1.1kbs and
is moderately structured (elements nested maybe three to four deep in places)
with tag names chosen to be readable rather than terse. The bulk of the data
is monetary values (expressed to two decimal places) and personal id info
(names, DoBs, id numbers).

Gzip -9 (ie best compression) reduces the dataset to 5.3% of its original
size. Xmill -9 reduces the dataset to 3.48% of its original size. The ability
to get roughly 50% more data into a given bandwidth is not to be sneezed at,
especially given an initial starting point of a near 20-fold reduction in
bandwidth requirements.

I know this probably doesn't help you with your own data any more than
the original paper did, but I trust someone may find this small endorsement
of the Xmill approach useful...

Andy Greener                         Mob: +44 7836 331933
GID Ltd, Reading, UK                 Tel: +44 118 956 1248
andy@g...                       Fax: +44 118 958 9005


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.