[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Binary XML == "spawn of the devil" ?
Apologies for restarting this thread. I've just returned from my vacation, and I'm working my way through a lot of e-mail that built up. Having read this entire thread now. there's one issue I noticed that's been feinted at a couple of times, but nobody seems to have taken it head-on. So please allow me to do that now. One of the goals of some of the developers pushing binary XML is to speed up parsing, to provide some sort of preparsed format that is quicker to parse than real XML. I am extremely skeptical that this can be achieved in a platform-independent fashion. Possibly some of the ideas for writing length codes into the data might help, though I doubt they help that much, or are robust in the face of data that violates the length codes. Nonetheless this is at least plausible. However, this is not the primary preparsing of XML I've seen in existing schemes. A much more common approach assigns types to the data and then writes the data into the file as a binary value that can be directly copied to memory. For example, an integer might be written as a four-byte big-endian int. A floating point number might be written as an eight-byte IEEE-754 double, and so forth. This might speed up things a little in a few cases. However, it's really only going to help on those platforms where the native types match the binary formats. On platforms with varying native binary types, it might well be slower than performing string conversions. Unicode decoding is a related issue. It's been suggested that this is a bottleneck in existing parsers, and that directly encoding Unicode characters instead of UTF code points might help. However, since in a binary format you're shipping around bytes, not characters, it's not clear to me how this encoding would be any more efficient than existing encodings such as UTF-8 and UTF-16. If you just want 32-bit characters then use UTF-32. Possibly you could gain some speed by slamming bytes into the native string or wstring type (UTF-16 for Java, possibly other encodings for other languages.) However, as with numeric types this would be very closely tied to the specific language. What worked well for Java might not work well for C or Perl and vice versa. Nonetheless it should be doable. A Java parser that worked directly on UTF-16 code points and did not directly decode characters should be able to be implemented. Verifying the well-formedness of surrogate pairs might be more expensive, but is rarely needed in practice. I think this could be fully implemented within the bounds of XML 1.0. I don't see why a new serialization format would be necessary to remove this bottleneck from the process. In summary, I am very skeptical that any prepared format which accepts schema-invalid documents is going to offer significant speedups across different platforms and languages. I do not accept as an axiom that binary formats are naturally faster to parse than text formats. Possibly this can be proved by experiment, but I tend to doubt it. -- Elliotte Rusty Harold elharo@m... Processing XML with Java (Addison-Wesley, 2002) http://www.cafeconleche.org/books/xmljava http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|