FW: XML Binary and Compression
Adding this complete reply to the list for the record. - Dan -----Original Message----- From: Winkowski, Daniel Sent: Monday, March 10, 2003 11:55 PM To: Elliotte Rusty Harold; msc@m...; xml-dev@l... Cc: winkowski@m...; msc@m... Subject: RE: XML Binary and Compression Rusty, The conclusions drawn were explicitly caveated by the fact that only one document type was tested. The binary message set was generated based on a standard which was designed for narrow bandwidth transmission. I agree that this paper would not have passed a peer reviewed journal as the dataset is not generally available and your skepticism is justified. However, I don't know what you mean by the statement "we can't tell whether the data set used to produce these results is similar to the sorts of XML data we're working with or not" since XML documents in the wild exhibit a wide range of characteristics (flat, deep, structured, unstructured). The military has been building binary messages optimized for size efficiency for decades. Our group has been working over the past several years to express a variety of messages, some based on binary specifications and some delimitated ASCCI, in XML. In all cases the XML version of these messages are larger than the original binary or ASCII. Is this surprising? I don't think so - metadata is not transmitted in either binary or delimitated ASCII formats. You state that the fact that binary files are smaller than the equivalent XML is decidedly untrue based on your experience. Quite frankly this surprises me. Our own experience is just the opposite. Finally, reference your interest in test set A, the same XML element naming practice was used in both sets. What really distinguishes set A from B was that A was created prior to the complete decoded binary sets being available. In other words set A was an approximation of the binary represented in XML. Consequently, set A could not be compared against the binary samples. When the complete binary decoding was found to differ from the XML representation used in set A the ASN.1 test had already been conducted and unfortunately they could not be repeated. So set A can not form the basis for a binary comparison but was used instead to compare the various encoding/compression techniques against one another. On reflection, I don't think that the conclusions reached are all that surprising. Redundancy based compression achieves better results as the file size, and consequently the amount of redundancy, increases. CODECS that take advantage of schema knowledge achieve efficient localized encodings and also need not transmit metadata since this information can be derived at decoding time. Matching XML documents to the appropriate algorithm can result in optimizations that rival native binary messages. There is no one size fits all XML compression/encoding algorithm. Optimization requirements can vary (speed, memory, document types, streamed decoding or navigability, etc.). However, just as gzip is an 80% solution for text I hope our study may point to an 80% solution for XML by matching the available data (XML document, document characteristics, XML schema if any, and user requirements) with a matched algorithm. I urge others to follow up on this study with their own experiments. All the techniques (gzip, ASN.1, XMill, MPEG-7) we used are openly available with the exception of our WBXML-like (XML Schema aware) algorithm. - Dan Winkowski PS: FYI, included below is a snippet of an XML instance document with the tags obfuscated but of the same length. The point being that the element names are not abbreviated down to two or three letter codes. <bbb53> <cccccccccccc54>fe471f81e65b800</cccccccccccc54> <ddddddddd55>0</ddddddddd55> <eeeeeee56>0</eeeeeee56> <ffffffff57>0</ffffffff57> <gggggggggg58>0</gggggggggg58> <hhhhhhhh59>701599</hhhhhhhh59> <iiiiiiiiiiiiiiiii60>36.879941</iiiiiiiiiiiiiiiii60> <jjjjjjjjjjjjjjjjj61>245.041988</jjjjjjjjjjjjjjjjj61> <kkkkkkkkkkkkkkkkk62>106652</kkkkkkkkkkkkkkkkk62> <llllllllll63>0.000000</llllllllll63> <mmmmmmmmmm64>1800</mmmmmmmmmm64> <nnnnnnnnnnnnnnnnnnnnnn65>ABC</nnnnnnnnnnnnnnnnnnnnnn65> <oooooooooo66>357.478638</oooooooooo66> <ppppppppppp67>0.000000</ppppppppppp67> <qqqqqqqqqqqqqqqqqqq68>36.669177</qqqqqqqqqqqqqqqqqqq68> <rrrrrrrrrrrrrrrrrrr69>244.784124</rrrrrrrrrrrrrrrrrrr69> <sssssssssssssssssssssss70>5.000000</sssssssssssssssssssssss70> <tttttttttttttttttttttttt71>105</tttttttttttttttttttttttt71> </bbb53> > -----Original Message----- > From: Elliotte Rusty Harold [mailto:elharo@m...] > Sent: Monday, March 10, 2003 10:40 AM > To: msc@m...; xml-dev@l... > Cc: winkowski@m...; msc@m... > Subject: RE: XML Binary and Compression > > > At 9:27 AM -0500 3/10/03, msc@m... wrote: > >Rusty, > > > >The corresponding paper can be found here: > > > >http://www.idealliance.org/papers/xml02/dx_xml02/papers/06-02 > -04/06-02-04.pd > > Thanks. The key point I gather from reading the paper is: > > Because of the sensitive nature of the study data, the > element names used in the sample XML data cannot be > discussed in this paper. It can be noted, however, that > the tag names used were unabbreviated, descriptive > terms. > > As mentioned above, the precise structure and content of > the samples cannot be presented here. However, the > general structure and data types of the XML documents > used for the study can be discussed. These are > illustrated in Figure 1, below. Although the study data > is not available to the reader, this depiction should > indicate that the XML sample structure and content is > sufficiently rich for the study purposes. > > In other words the raw data is not available, so it's impossible for > anybody to independently verify these results. Perhaps more > importantly, we can't tell whether the data set used to produce these > results is similar to the sorts of XML data we're working with or > not. We don't know whether these results would likely be reproducible > in our own environments. > -- > > -----Original Message----- > From: Elliotte Rusty Harold [mailto:elharo@m...] > Sent: Sunday, March 09, 2003 7:06 AM > To: xml-dev@l... > Cc: winkowski@m...; msc@m... > Subject: Re: XML Binary and Compression > > > >Interesting paper from MITRE > > > > > http://www.idealliance.org/papers/xml02/slides/winkowski/winkowski.pdf > > > AND ALSO IN REPLY TO > -----Original Message----- > From: Elliotte Rusty Harold [mailto:elharo@m...] > Sent: Sunday, March 09, 2003 7:06 AM > To: xml-dev@l... > Cc: winkowski@m...; msc@m... > Subject: Re: XML Binary and Compression > > > >Interesting paper from MITRE > > > > > http://www.idealliance.org/papers/xml02/slides/winkowski/winkowski.pdf > > > > Interesting, but there's really not enough information in the > PowerPoint slides to fairly judge the work. In particular, I'd really > want to see the actual data they used. They started with the > assumption that typical binary files were necessarily smaller than > the equivalent XML, something that is decidedly untrue in my > experience. > > Test set A was fabricated by the authors, and I suspect they paid a > lot more attention to making it small than anybody actually does in > practice. Test set B was "derived directly from binary sample data" > but they don't seem to ever show you what this binary sample data was > or what its XML encoding was. > > I look forward to a more complete paper that provides sufficient > information to verify and reproduce the results. Will one be > published anywhere?
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format