[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Xqueeze: Compact XML Alternative
On Thursday 06 February 2003 11:58 am, you wrote: > Alaric B. Snell wrote: > >>* Competition with compression: xqML in it's current format is as > >> structured as XML so it too compresses well. In an experiment[1] a > >> 12 kB HTML document zipped to 2 kB. The (handwritten) BE for the > >> same document took 3 kB and when zipped, it took less than 1000 > >> bytes. > > > >Mmmm, it bugs me when people compare gzipped XML with $binary_format. They > >should compare XML with $binary_format and gzipped XML with gzipepd > >$binary_format. gzipped $binary_format will, in general, be the smallest > > of them all, and yet faster to read/write than gzipped XML. > It doesn't necessarily (or even generally) work that way - compact > binary formats don't generally compress down as well as text, so you end > up with size(text) > size(binary) > size(compressed-binary) > > size(compressed-text). I've always found that compressed binary is smaller than compressed text, as Tahir found. That makes sense logically too; both the binary and text formats have the same CDATA in but the binary format has more compact representations of the elements and so on. Of course, one could design binary formats which compress badly, but I've never found that they do by default. You get a better compression *ratio* with text formats since there's more redundant information to remove, but that doesn't improve the compression of the non-redundant data. > That seemed to be the case with my XMLS format > (http://www.sosnoski.com/opensrc/xmls - still on hold, though I hope to > get back to it soon). One of the oddities of how compression works... Hmm. gzip works roughly like thus. 1) Use a sliding window algorithm to find repeated substrings and replace them with backreferences, expressed as the number of bytes backwards to copy from and how many bytes to copy. The output of this is a mixture of literal bytes and backreference tokens. 2) Take the stream of literal bytes and represent them by variable-bit-length codes. The exact algorithm is a bit complex, but basically, very common characters like spaces get codes of a couple of bits length, while uncommon characters like * might get codes more than a byte long. Note also that the encoding is adaptive - if there are a lot of Xs torwards the beginning of the file but a lot of Ys at the end then Ys will get given shorter and shorter codes as you delve into the Y-zone. 3) Do the same variable-bit-length trick on the backreference distances and lengths. 4) Merge the two strings using a few tricks to seperate backreferences from literal data which I can't remember so won't detail here. So what do we gain here? 1) If a string is used then used again later, the string can be replaced with a backreference of a couple of bytes (including the effect of variable-bit-length encoding on the backreference which in a tag-rich XML document will soon give short codes to the average distance between <X> and </X> 2) If certain bytes are a lot more common than others then those bytes will be represented by codes a few bits long - in XML, < > ' " = and space might get represented in three bits each or so. 3) That sliding window used to look up repeated strings is only so big. I don't think it can be more than 32k in gzip from memory. How does this compare between binary and text compression, then? Consider: <invocation> <methodName>logTemperatureReadings</methodName> <params> <int>123</int> <int>456</int> <int>34234</int> <int>2345</int> <int>7644</int> <int>2342</int> <int>63567</int> <int>234</int> <int>23435..</int> ... </params> </invocation> The adaptive variable-bit-length encoder starts off with an initial dictionary suited roughly for English text, IIRC, so <invocation> will probably be encoded in around 4-5 bits per character. Likewise with <methodName>logTemperatureReadings</, but then it replaces "methodName>" with a backreference of a few bytes. <params> comes out as 4-5 bits per char too. Then it gets interesting. We have a repeated sequence of " <int>", each of which will probably soon be less than a byte of backreference since it will keep coming up with backreference distances varying only in the length of the integer. Likewise with "</int>\n". So we end up with, say, two bytes of overhead per integer param. Then the numbers - the only literal bytes occuring during the bulk of the file - will come out at about 4 bits each, meaning that for each integer averaging 4 digits long we use up four bytes in the compressed stream. The </params>\n</invocation> bit may, if the file is less than 32k long, refer back to "params>" and "invocation>" from before. A tight binary encoding of this, as generated by PER, would store the parameter list as a variable-length number for the number of items to expect than a variable-length number per integer. The VLNs for four digit numbers will tend to use two bytes each with reasonable entropy in the bottom byte - meaning that all 256 values are about equally likely - and the top byte typically only covering a restricted range; the full range of two bytes (after losing two bits to the variable length encoding) would be from 0..16383 which the above numbers seem to use only a subset of, meaning that the range of bytes used in that block of numbers would be slightly shy of uniform, maybe allowing the compressor to squeeze each number into 14 or 15 bits each if it's lucky. Now, that's about half the size of the gzipped XML. In this case, the " <int>" and "</int>" overheads are mainly squashed away by the sliding window compressor, which has very little to work with in the binary case unless there is redundant repeated sequences in the actual data. Also, the restriction to the character set 0-9 for the numbers means the encoder can fit each digit into three or four bits, but in the binary case the encoder gets to directly work against the frequency distribution of the underlying data. The moral of this story is that binary formats tend to insert less redundant information and inefficient encoding that gets between the compressor and the actual information being compressed. If our temperature data happens to have an uneven frequency distribution, mainly small integers with a few going up above 5,000, the gzip algorithm will not be able to deduce this from the decimal encodings of those numbers; all it sees are strings of digits. But in binary form, a skewed distribution of integers is accessible to the encoder as a skewed distribution of input bytes, that it can capitalise upon. > David Mertz has done some research in this area - see his article at > http://www-106.ibm.com/developerworks/library/x-matters13.html Also see Ok, he brings up bzip - the bzip algorithm is somewhat wierd, but it is better at compressing repeated strings like element and attribute names than gzip. He doesn't seem to know that zip and gzip use the same underlying algorithm, deflate, which I've roughly summarised above. Oh well. The "improvement" in gzip's compression is just because it has less header information, really. In all of his examples the compressed 'raw' files are smaller than the compressed xml files. His structured XML approach doesn't gain much on hamlet, no, since hamlet's mainly content text anyway. No surprises there. He's still always encoding variations on the attributes and elements though; in that Apache weblog, the bytes-transferred will still be written in decimal in even the struct form, not as a binary number so the compressor is still being crippled! > James Cheney's paper "Compressing XML with Mulitplexed Hierarchical PPM > Models" at http://www.cs.cornell.edu/People/jcheney/xmlppm/paper/paper.html *reads* Still just working around the encodings of close and start elements and all that, though. > Try a range of documents and see how the compression works out before > making any claims. I did; see your quoted references for more that support my statement :-) Quoting the IBM one: 1764816 weblog.struct 59955 weblog.struct.bz2 66994 weblog.xml.bz2 56156 weblog.txt.bz2 76183 weblog.struct.gz 115524 weblog.xml.gz 59409 weblog.xmi.bz2 74439 weblog.xmi.gz The .xml.bz2 is bigger than the .struct.bz2, showing that the binaryish format compressed smaller than the XML one. Try a range of documents and see how the compression works out before making any claims :-) > For compression of XML text bzip2 looks like the best > choice from what I've seen, so that should probably be the basis for > comparison. Yes. It's interesting to see how it varies between different types of XML document, too; text-rich (hamlet), tag-rich (weblog, kind of. It has the same tag structure repeated over and over again; I'm thinking more of things like XSLT here) and data-rich (weblog). IIRC my comparisons involved some large Docbook sources, some XSLT, and a data set that was inherently tabular in nature like the weblog example. ABS -- Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream Heed the path that led me to that place, Yellow desert screen
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|