[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: XML Performance in a Transacation
--- Michael Kay <mike@s...> wrote: > > My expectation is that XML parsing can be > significantly sped up with ... > > I think that UTF-8 decoding is often the bottleneck > and the obvious way to > speed that up is to write the whole thing in > assembler. I suspect the only I think this highly depends on content, and definition of bottleneck: for western european (ascii/iso-latin1) subsets the difference I have observed is 15-20%; between equivalent (7-bit ascii) content declared as utf-8 / iso-8859-1. Since iso-8859-1 decoding is trivially easy, the highest possible speed up would be in 15-20% range (except if decoding and parsing were tightly coupled -- option I am planning to explore in future). I would expect overhead to be more significant for content with high ratio of non-ascii chars however. > way of getting a significant improvement (i.e. more > than a doubling) in > parser speed is to get closer to the hardware. I'm > surprised no-one has done > it. Perhaps no-one knows how to write assembler any > more (or perhaps, like > me, they just don't enjoy it). I think big reason is that pay-off just does not seem THAT high. For c/c++ hand-coded assembly seldom yields particularly good return (on commodity hardware); and even going to native code from things like Java is just an incremental improvement (if any), but with associated drawbacks. Besides, writing a truly compliant xml parser is tedious (but extensive) work. ;-) Writing specialized parsers for subsets (as in case of what Soap requires) is easier; yet performance boosts from all hopeful coders seem elusive when one compares apples to apples. The problem with XML parsing by hardware is that it Just Does Not Pay Off: if you get, say, 20% boost (and usually sacrificing full xml compatibility as well), but have 20%+ overhead on memory transfer from your card to main memory (after all, I/O is the major overheda component of parsing nowadays), there's little point in going through the trouble. And this is exactly what happened with at least one of vendors (according to comments by an engineer who worked with one of companies: they started looking into more lucrative areas as their "xml accelerator" lost any boost at linux driver level). Thing is: performance improvements for XML will need to be found above tokenization/low-level parsing level. There's very little left at raw parser level: when you get raw throughputs at level between 100/1000 Mbps switched Ethernet (the earlier 40MBps rate equals 400 Mbps ethernet bit stream speed -- close or above practical max transfer rates over gigabit ethernet), and yet at higher processing level talk about 20 tps for Soap (just one of figures I recently saw attributed to Axis 1.x; with 4k messages and replies ~= 0.16 MBps), it's clear that problems are somewhere between application code and parser. Pure parsing performance does not degrade with megabyte-sized input. I have no problems parsing my 500 megabyte product description data dump, and processing it entry by entry (result set size grows logarithmically or less). Doing full in-memory general purpose transformation performance does degrade in such a way, however; for obvious (memory locality) and perhaps other reasons. -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|