[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: XML processing experiments
First off, thanks to James for a some very thought-provoking work. At 07:03 PM 04/11/97 +0700, James Clark wrote: >If all you want to do is be able to >correctly parse well-formed XML, and you don't care about detecting >whether or not it is well-formed, how much code does it take and is it >significantly faster than using an XML parser ... >Lark: 10.5 seconds .. MSXML: 24 .. nsgmlsu: 8 .. sgcount:11 .. >xmlec (C): 0.5 seconds .. (Java): 1.5 seconds. [BTW, when I got Lark to run "almost as fast as SP", I decided that was qualitatively fast enough for now]. >I was quite surprised that there was such a big performance difference No kidding. Discussions here are a bit dangerous, since in the Java domain, we are kind of operating in the dark; we don't have profiling tools with really good granularity. This is my excuse for engaging in performance analysis based on intuition, something for which I have personally fried more than one junior programmer. Let's look at James' code eating up a "-quoted literal, where characters are in the byte array buf[], start and end being integer indices therein: case (byte)'"': { for (++start; start != end; ++start) { if (buf[start] == (byte)'"') { nextTokenIndex = start + 1; return TOK_LITERAL; } The following are candidates for why a program like Lark or MSXML might run slower. - works with Java char rather than byte variables - does a method dispatch (or at least a few conditionals) per character processed for at least two reasons: to manage the entity stack, and to have a place to put the different character encoding processing modules. [Note: A look at James' code makes me wonder if this is *really* as necessary as I thought] - does quite a bit more work upon recognizing some markup constructs; in particular for a start tag it pulls appart the attribute list and packages up the element type & attributes in a nice structure convenient for an API user I went and looked at Lark's main loop, and for a 'typical' character processing mode, i.e. it's not the begin or end of a tag or attribute or something and no buffers run out but the text is being saved, it ends up executing 25 lines of Java including one getXmlCharacter() method dispatch; none of them are monster conditionals or anything. James' code above, in the equivalent case, is executing 3 I think. so while lines-of-code is very shaky yardstick indeed, the difference is 8 or 9 to 1, which is not out of line with the observed performance difference. My intuition is that what's holding Lark back is (a) the per-char dispatching, and (b) turning the DFA crank, which requires a 2D array reference, then a shift & mask I have some ideas on how to fix both, but first I have to make Lark do conditional sections and validate (neither should slow it down significantly). One other experiment would be useful, that might shed light from a different angle. James, how about doing element counts per type; i.e. actually *using* some of the info come back from the tokenizer, nothing fancy, just use a java.util.Hashtable or some such; should be able to run very similar code on Lark and your TokenStream thing; I wonder if it would change the numbers. I'll get around to this sometime if nobody else does, but not for the next 2-3 weeks. -Tim xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|