[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: No XML Binaries? Buy Hardware
Some readers of this thread may be interested in the paper my Research group published last year at XML 2006, titled: "XML Screamer: An Integrated Approach to High Performance XML Parsing, Validation and Deserialization" (Full paper available online at [1]). It makes the case that there are certain factors that should be considered >quantitatively< when making statements about which technologies are fast or slow. For example, if CPU time is the primary concern (as opposed to, say, size and/or transmission time), then you really need to ask yourself questions like: how many CPU instructions per input byte are being executed by the implementation I have in mind, and is that number in some sense reasonable? What we found in the case of XML was that a lot of people were running around making statements like "regular text based XML is too slow for application X", and then you'd ask them questions like, "what parser are you using and with what API?" They might answer: Xerces with SAX feeding some Web Services deserializeer. Well, when you look at what a processor like that is doing, the answer is that it's executing hundreds of instructions, on average per input byte. Now you ask why, and you find out that some of that overhead is inherent in what seem to us the best possible approaches (e.g. it seems essential to do at least some form of comparison on each byte of input if you are to check well formedness), but much of the overhead comes from things like doing UTF-8 to UTF-16 conversion of tags, many of which are just again string compared (in their long UTF-16 form!) after SAX hands them up to the application or deserializer. With better APIs, you can extract the necesary information from text XML much, much faster. On the other hand, for other applications you may really need SAX or DOM. The point is to measure both binary and text against the particular applications of interest, and using APIs representative of what you'd deploy to optimize each in that context. The point of the above example was that it was not just XML itself that was causing the overhead: it was XML along with the particular choice of APIs and processing layers, perhaps in some cases aggravated by implementations that just weren't as careful as they might have been. Indeed, one of the main reasons that Xpat is faster, though still not as fast as we managed to go in our experiments, is that it passes strings around in the native encoding of the input document. Also: I've got nothing against SAX. It's fine as a standard for interoperation at medium speed. The fact that it's so much faster than most DOM's has led to the misapprehension that it's not a performance bottleneck relative to what XML can do. In many contexts, it is a bottleneck. Am I saying Binary XML is a bad idea? Not at all, though I've said that I'm unconvinced that standardizing a single binary form of XML is the right thing to do. I am saying that there's a lot of misinformation out there about what really leads to good or bad performance either for regular text XML or for particular binary flavors. You can take the "best" (by whatever metric) binary XML in the world, force your application through a sub-optimal API, and your performance may be limited. You can easily obscure the true differences between the approaches. Actually, I believe that a careful, quantitative analysis will show that particular binary forms are indeed much faster for certain applications, especially if the APIs are tuned right. There's no question that, for example, checking end tags is slower than not having to check end tags. The fact that alignments in XML are variable tends to slow things relative to formats in which counts are sent as naturally aligned integers (especially if you luck out and sender and receiver agree on byte order.) That's because almost every modern processor is much faster at loading an aligned number than at working through unaligned characters. It has to do with how the memory and cache hierarchies are built. It's also true that binary formats, even those that aren't schema aware, tend to be able to use string pools and string handles: comparing integer handles is almost always much faster than doing string compares. That sad fact is that most XML implementations, and for that matter many binary XML implementations, are so sub optimal at this point that those factors are being hidden by other unnecessary overhead. The resulting comparisons between XML and Binary are noisy at best. Now, whether the true extra overhead of text XML is really significant after you finish optimizing it well is a different question. Deploying a good binary XML implementation onto lots of platforms will take lots of work. Tuning XML implementations super-well will take lots of work. When you're done, I do believe the binary will be somewhat, occasionally dramatically faster for many purposes. Whether the difference will be significant given the overhead in the rest of the application, given particular choices of API, etc. will depend on your application. I think the answer will be "yes" in selected important applications, and "no" in many others. The main point of this note is to suggest that these questions need to be considered quantitatively, and with the sort of low level tests and benchmarks that allow you to account for the instructions your processor is executing. I'm somewhat tired of hearing about Java implementations of XML (or binary) that are slow, bit for which nobody can say whether the JIT is doing a good job of inlining. In such cases, you don't know whether you're measuring XML or a deficient Java optimzier. You may not know whether your JIT is doing the same job on both technologies, because optimizers are notoriously sensitve to details of particular applications. To really know what you've got, you've to get into the running code and see what machine code the JIT has produced (we actually did that, but we found it to be such a pain that we publicly reported mainly our C language results, for which checking the machine code is much easier.) Anyway, I hope the paper is of interest. We had fun doing the work. Noah [1] http://www2006.org/programme/item.php?id=5011 -------------------------------------- Noah Mendelsohn IBM Corporation One Rogers Street Cambridge, MA 02142 1-617-693-4036 --------------------------------------
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|