[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: JITTs and DOM


up donm
Hi Patrick,

>>I'd be *very* careful about drawing any conclusions about speed up
>>from these observations. What you've done for these observations is
>>replace markup-significant characters (e.g. '<') with
>>markup-insignificant characters (i.e. '@'), effectively turning
>>whole regions of the document into plain text.
>
> I said in my post that these were observations that suggest further
> investigation. The replacement was noted on the webpage as
> simulating the result of a JITTs parser. Yes, the operation of a
> JITTs parser would be to treat regions of the document into plain
> text. Sorry if that was not explicit in our earlier treatments of
> JITTs parsing.

I understood that the *output* would be plain text, but I thought that
the *input* would be marked-up text. This wasn't the case in the
samples that you were using for your observations. I did see that you
characterised them as "observations" and said that you would do more
investigation, I just didn't want you or anyone else to get too
hopeful about 30x speedup on the basis of these particular
observations.

>>It wouldn't be enough to just ignore all the tags that the parser
>>came across (which is what you've done in effect). Instead, the
>>parser would have to read the tag, look at the name of the tag,
>>check that against a list (from a DTD or schema) in order to work
>>out what to do, and then either generate a "start/endElement" event
>>or generate a "characters" event (to report the tag as a string)
>>depending on the tag's status. If anything, I imagine that this will
>>*add* time to the parsing of the document.
>
> Parsers already build a tree from the DTD or schema in order to
> "recognize" the markup it encounters in the document. All JITTs
> would require is in the lookup step, where a parser now looks for
> the token in the tree is that upon failure, the parser starts
> reading input again. (That assumes you are using the suggested
> ignore option, with delete, it would drop the token from the imput
> string and continue reading input.)

Absolutely. I think I wasn't clear -- I was describing the extra work
that a parser would have to do on top of the "scanning plain text
until you come to a '<'" parsing that your observations were
demonstrating, not the extra work on top of XML parsing.

For what it's worth, the lookup step is not hard to implement as a SAX
filter on top of an existing XML parser (that's basically what I did
when I implemented basic filtering from LMNL documents into XML). As
Rick pointed out, filtering-by-namespace is a very easy place to start
and wins you a lot immediately, but one of the things that I think
we're both interested in is filtering-by-schema/DTD, which is harder
but more powerful and interesting.

The other thing that I think is promising about the JITTs approach is
the ability to parse just the bits of the document that you're
interested in, on the fly, during processing. A DOM implementation
that did this behind the scenes could be very effective. (I'm sure
that native XML databases / content management systems do this kind of
thing all the time; I don't know if any in-memory DOM implementations
do, or if it's been tried and for some reason rejected?)

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.