[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Textual transmission typically faster?
Eugene Kuznetsov writes: > That shouldn't be surprising -- generally speaking a parse tree is > larger than the text input to the parser. (And java serialization > doesn't win any awards for efficiency in either time or space). > > However, a binary representation of the same data (specified in > ASN.1 and encoded using BER, say) would be much smaller and more > efficient. Consider <foobar>data</foobar> If we are using UTF-8, this will require 21 bytes. In a non-navigable binary format (i.e. for one-pass reading, like XML itself), we would have the "foobar" name in a name table and would end up with something like this: 1 byte: type code for start element 4 bytes: pointer to name "foobar" in table 2 bytes: number of attributes (0) 1 byte: type code for text 4 bytes: data length (4) 4 bytes: data 1 byte: type code for end element Using UTF-8 (again) this gives us 17 bytes, a slight improvement. If the amount of text were increased to only say, 20 bytes, then the ratio would be 37:33 rather than 21:17 (not that exciting at all), and that's pre-compression. If you knew for sure that none of your documents would be very long, you could shorten some of the pointers to 3 or even 2 bytes, but then your format wouldn't be generally usable for XML processing. In a navigable binary format, things become much worse, because the binary format needs left/right sibling and parent/first-child pointers, so we end up with something like this: 1 byte: type code for start element 4 bytes: pointer to name in table 4 bytes: pointer to left sibling 4 bytes: pointer to right sibling 4 bytes: pointer to parent 4 bytes: pointer to first child 2 bytes: number of attributes (0) 1 byte: type code for text 4 bytes: data length (4) 4 bytes: data 1 byte: type code for end element That gives us a total of 33 bytes (assuming no navigation pointers at the end of the element) in binary format vs. 21 bytes in clear text, and that's before compression. There would be significant advantages to such a binary format, though, since you could create a DOM interface that could navigate large documents without loading them completely into memory first. Just don't count on file-system space savings. As for time efficiency, that depends on my much time people put into writing the loaders. James Clark's C-based Expat already parses XML at pretty-much I/O speed, so you'd be hard pressed to beat it. The Java-based XML parsers aren't quite as fast, but they're still pretty impressive (there was a bit of friendly rivalry a couple of years ago over benchmarks, and several of the main XML parsers were optimized almost beyond belief). Even if you could write a one-pass binary-format parser that could run, say, 20% faster than the best XML parsers in the same language, that advantage would make almost no difference to total application time, since, in my experience, actual XML parsing (as opposed to building object trees, DOM or otherwise) accounts for well under 5% of execution time for even the simplest real-world applications. A 20% improvement in parsing time would give you at most a 1% improvement in application execution time. If the document is coming over the net rather than from a local drive, even that small advantage will probably be lost in network latency. In summary, then, there are some good arguments for writing a navigable XML-like binary format to allow large documents to be processed with a DOM without loading the whole document into memory. I can see very little argument for using a binary format simply for space or efficiency. All the best, David -- David Megginson david@m... http://www.megginson.com/
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|