[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: The Rising Sun: How XML Binary Restored theFortunes of In
This has been interesting. A few points that I think are important: 1: What operations do you want to do? Whether binary formats "are faster" depends on this more than anything else. If you want to "generate the DSIG digital signature" it's hard to imagine that a binary format could ever be faster, because the DSIG is generally defined on the text stream -- so to correctly calculate it, you'd have to crank through your binary structure and re-create the text stream on the fly. That costs more time than simply scanning a text stream that's already there. On the other hand, if you want to "skip to the next section" a halfway decent binary format will clean up. That's because it will have a pointer and get there in one step, while the parser has to go looking. Even if the looking is really quick, it's slower than not looking at all. If you want to examine the XPath "ancestors" axis, resolve IDREFs, and so on, you'll be best off with a binary thing that knows all that already because it figured it out when it was first generated. If you want "preceding", all bets are off unless the implementor specifically designed for it. You want all occurrences of "the", binary and text will be about the same, and you'd need a full-text index to improve it. This reminds me to mention that what someone said about it being impossible to say what is faster because implementations differ, is oversimplified. Of course an incompetent implementation of either type can manage to be immeasurably bad. But some algorithms are inherently faster than others; and binary representations have a larger choice of algorithms. 2: Where is the data kept? Often the biggest speed factor of all is what data is in RAM vs. disk vs. over the net. RAM is about 10,000 times faster than disk (very expensive disk seeks). Binary formats, historically, are intended to overcome this obstacle. Jeff Vogel and I wrote the first binary SGML implementation that could handle large documents, in early 1990. Our staff tweaked its parser for XML later. Because we didn't have to touch the binary representation at all, it would not be stretching much to say we had a binary XML representation up and running in 1990. For the usual operations required to search and render documents, nothing I've seen yet has been faster for big documents. But it is uncommon these days that single XML documents are too big to be kept in RAM. The company was Electronic Book Technologies, the product was DynaText, and it was mainly used for *really* big documents, like F-16 manuals that on paper would outweigh the plane. Typical size for a *single* document was 10-250MB. You could open a document that big, go to anyplace determined by an XPath-like expression, render, and have the text on the screen in about 1 second. If you want an interesting contrast, make yourself a 1MB HTML file, open it in a browser, scroll to the bottom, and then resize horizontally. On the other hand, it was purely a delivery system, and you couldn't update in place although the binary format used theoretically could. There are at least 11 patents on it, so anyone can go see one way to design a binary XML format that fast (though some of the cooler tweaks were post-patent). One expects that any committees involved would do that. Perhaps after 15 years they could do something substantially better -- but we'll see. 3: What does "lossless" mean? A few other recent postings have mentioned this issue. I think most people would consider a format "lossless" if you could export from it back into XML syntax, and when you parsed the resulting XML you got the same DOM as for the original document. If that's enough, it's not hard to make a lossless binary format (and mine was lossless, except I think it discarded comments and PIs). HOWEVER, this is not completely lossless. You still lose (among other things): * the entity structure * being able to get a matching DSIG * all sorts of really ugly whitespace normalization details (including within tags) * single- versus double-quoting of attributes * namespace prefix usage * order of attributes * <br /> vesus <br></br> So, until you define "lossless", there's no point in comparing whether two products are lossless or not. Transportability also poses problems. For example, if you mean to move from one system to another, you have to worry about any binary numbers you store -- some systems store the high-order bytes first, some store them last. We insisted on making our binaries readable across platforms, and that involves a lot of byte-swapping overhead that XML parsers never have to mess with. 4: Hybrid solutions If you only need to optimize certain operations, you can do it within XML: Make a pass over the file and add attributes as needed. In the right setting, this could be really fast (though it's harder than it looks): <sec b:next-sibling-offset='99999' prev-sibling-offset='241'...> Also, if you mainly need to optimize resolving IDREFs, just make a separate index that says where they are, and leave the XML as is. XLink works nice for this. What I'm saying overall is that the solution space is much wider than it may appear, and the answers are more complex. Also, that it can be, and has been, done successfully. But except for really huge documents, I don't think it's usually worth the effort. Steve -- Luthien Consulting: Real solutions to hard information management problems Specializing in XML, schema design, XSLT, and project design/review/repair Steven J. DeRose, Ph.D., sderose@a...
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|