[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: JITTs and DOM
Hi Patrick, >>I'd be *very* careful about drawing any conclusions about speed up >>from these observations. What you've done for these observations is >>replace markup-significant characters (e.g. '<') with >>markup-insignificant characters (i.e. '@'), effectively turning >>whole regions of the document into plain text. > > I said in my post that these were observations that suggest further > investigation. The replacement was noted on the webpage as > simulating the result of a JITTs parser. Yes, the operation of a > JITTs parser would be to treat regions of the document into plain > text. Sorry if that was not explicit in our earlier treatments of > JITTs parsing. I understood that the *output* would be plain text, but I thought that the *input* would be marked-up text. This wasn't the case in the samples that you were using for your observations. I did see that you characterised them as "observations" and said that you would do more investigation, I just didn't want you or anyone else to get too hopeful about 30x speedup on the basis of these particular observations. >>It wouldn't be enough to just ignore all the tags that the parser >>came across (which is what you've done in effect). Instead, the >>parser would have to read the tag, look at the name of the tag, >>check that against a list (from a DTD or schema) in order to work >>out what to do, and then either generate a "start/endElement" event >>or generate a "characters" event (to report the tag as a string) >>depending on the tag's status. If anything, I imagine that this will >>*add* time to the parsing of the document. > > Parsers already build a tree from the DTD or schema in order to > "recognize" the markup it encounters in the document. All JITTs > would require is in the lookup step, where a parser now looks for > the token in the tree is that upon failure, the parser starts > reading input again. (That assumes you are using the suggested > ignore option, with delete, it would drop the token from the imput > string and continue reading input.) Absolutely. I think I wasn't clear -- I was describing the extra work that a parser would have to do on top of the "scanning plain text until you come to a '<'" parsing that your observations were demonstrating, not the extra work on top of XML parsing. For what it's worth, the lookup step is not hard to implement as a SAX filter on top of an existing XML parser (that's basically what I did when I implemented basic filtering from LMNL documents into XML). As Rick pointed out, filtering-by-namespace is a very easy place to start and wins you a lot immediately, but one of the things that I think we're both interested in is filtering-by-schema/DTD, which is harder but more powerful and interesting. The other thing that I think is promising about the JITTs approach is the ability to parse just the bits of the document that you're interested in, on the fly, during processing. A DOM implementation that did this behind the scenes could be very effective. (I'm sure that native XML databases / content management systems do this kind of thing all the time; I don't know if any in-memory DOM implementations do, or if it's been tried and for some reason rejected?) Cheers, Jeni --- Jeni Tennison http://www.jenitennison.com/
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|