[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: I have implemented SAX based XPath Engine

  • From: Philippe Poulard <philippe.poulard@s...>
  • To: Michael Kay <mike@s...>
  • Date: Fri, 20 Feb 2009 18:14:08 +0100

Re:  I have implemented SAX based XPath Engine
Michael Kay a écrit :
>> I have implemented few years ago such a tool, the global 
>> design is described here:
>> http://reflex.gforge.inria.fr/saxPatterns.html
> 
> Thanks, I had seen this article before but it was worth rereading.
> 
> I'm puzzled by the concept of keeping exactly four counters. For an
> expression like
> 
>   child::p[@a=17][5]

you're right, and grouping is not supported either:

this one should work:
/foo/*/*[last()]
but this one doesn't :(
(/foo/*/*)[last()]

> you need a counter of how many nodes have satisfied the @a=17 predicate. I
> would have thought it was better to generalize this so that a counter is
> always relative to a predicate, for example p[17] is treated as
> child::node()[self::p][17] and the counter is relative to the predicate
> [self::p].

great idea !

I will correct my documentation about those unsupported cases

> Doing this using threads does look like a recipe for adding performance
> overheads. (Saxon's streaming sometimes uses two threads, but never more). 

I use 2 threads for each filter, and in a pipeline with n filters, there 
will be n+1 threads

I
> would have thought the great advantage of using a push model is that you can
> push the same events to multiple destinations without the use of multiple
> threads. Saxon does this when validating instance documents against XML
> Schema identity constraints, where any number of constraints can be active
> at any time. I agree with your comment, however, that the ideal would be to
> compile some kind of finite state machine. That would work, of course,
> regardless of whether the evaluation is in push or pull mode.
> 
> The XSL WG has been doing work on streaming over the last year or so (hence
> my interest). There's nothing available to publish yet, but we've made some
> good progress and have built up quite a substantial collection of use cases.
> We've been trying to define constructs that are amenable to "pure streaming"
> without any buffering or lookahead, but it's difficult to draw the
> boundaries.

Very interesting, I look forward to read all that !

Best regards,
Philippe

> 
> Michael Kay
> http://www.saxonica.com/
> 
>> It is designed to filter SAX streams with XPath-based 
>> patterns, but has also useful filters to process text 
>> (non-XML) inputs:
>> http://reflex.gforge.inria.fr/tutorial-pipelinesAndFilters.html
>>
>> XPath is rather well supported: position() and last() are 
>> supported but it doesn't support preceding:: and 
>> preceding-sibling:: axis except in very few circumstances, 
>> but I also propose a workaround when filtering a SAX stream 
>> (juggling with a local DOM subtree when necessary).
>>
>> To answer to Michael about how predicates are evaluated when 
>> reading forward is required, the engine uses a lookahead 
>> buffer and goes on reading until the actual predicate becomes 
>> evaluable; for that purpose, as explained in the article, the 
>> engine uses coroutines that are implemented using threads 
>> (one to evaluate the predicate, the other to hold the 
>> position in the call stack of the current startElement() 
>> event); I think that a finite state machine based on a pull 
>> parser would be much more efficient: although the stuff works 
>> somewhat well, I have noticed that it runs slowly when I use 
>> lots of XPath patterns in a pipeline made of lots of filters, 
>> and it can be an issue when reading GB of XML. I also know 
>> that things here and there have to be optimized, for example 
>> instead of evaluating the entire set of XPath patterns on 
>> each event, I could recognize that a subset is irrelevant for 
>> a given branch and I should discard them in that branch (but 
>> currently it doesn't work like that); there are also things 
>> to do better about partial evaluation specifically when 
>> comparison operators are involved, for example [count(foo)>9] 
>> should exit when 10 <foo>s are met rather than when the 
>> 1000000 specimen are read. I have imagined a strategy where 
>> the count() function should return something different than a 
>> number, a numeric object evaluable several times by the 
>> operator that could fetch more data on demand, until the 
>> NumberThatIsAtLeast object reach (or not) the expected value. 
>> Lots of work in sigth.
>>
>> Of course I will have a look at Santhosh's work :)
>>
>> --
>> Cordialement,
>>
>>                ///
>>               (. .)
>>   --------ooO--(_)--Ooo--------
>> |      Philippe Poulard       |
>>   -----------------------------
>>   http://reflex.gforge.inria.fr/
>>         Have the RefleX !
> 


-- 
Cordialement,

               ///
              (. .)
  --------ooO--(_)--Ooo--------
|      Philippe Poulard       |
  -----------------------------
  http://reflex.gforge.inria.fr/
        Have the RefleX !


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.