[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Disk-based XPath Processing

  • From: Philippe Poulard <Philippe.Poulard@s...>
  • To: Uche Ogbuji <uche@o...>
  • Date: Mon, 02 Oct 2006 11:47:26 +0200

dom delete
Uche Ogbuji wrote:
> Tatu Saloranta wrote:
> 
>>Alas, although there is quite a bit of interest, I
>>haven't seen solutions where streaming parsers could
>>use some suitable subset of XPath to match sub-trees
>>(suitable meaning that only some axes were supported,
>>parent/grandparent, attribute, children, but not
>>sibling). I have been hoping to investigate doing this
>>myself in near future, since it would seem to simplify
>>some streaming-oriented tasks (like only building
>>small sub-trees, or one sub-tree at a time from a
>>bigger document).
> 
> 
> What you describe in the above para is pretty much exactly what Amara's
> pushbind and pushdom allow, and the trimxml tool that John L. Clark
> mentions, exposes this approach on the command line.  They use a subset
> of XSLT patterns (which are  themselves a subset of XPath, as defined
> int he XSLT 1.0 spec) to drive a streamable operation that only loads
> into memory one subtree at a time from a larget document.  I think it
> does still need a little baking, but I've been successful using it for
> some pretty heavy-duty work.
> 

hi,

I have been working few months ago on XPath filtering on SAX streams ; 
it support XPath patterns with predicates and forward axes, etc, like this :

a[@b]
a[not(@b)]
a[@b='c']
a[@b='c']/d[@e]
/a/b/c[1]
a/*[2]
a/comment()[3]
a/node()[position() < 4]
/a/b/c[last()]
a/*[count() > 3]
a/node()[last()]
a[following-sibling::b]
a[b]
a[*[not(self::b)]]
id("foo")
id("foo")/child::para[position()=5]/a/b/c[last()]

but you should be aware that :
-when parsing, if you use an expression that consist on reading the 
whole tree, the whole tree will be cached, and you should use DOM 
instead ; that is to say if you do silly things, you'll get them ; if 
you have a really huge XML file, don't do such things otherwise you'll 
get an OutOfMemory error
-when a node has been discarded, you can't reach it again : revert axes 
(except ancestor axes) are not available ; the sole thing you can do is 
to anticipate by storing a part of the tree in a DOM fragment and work 
with it, then discard it

the technique used is described on
http://reflex.gforge.inria.fr/saxPatterns.html
(this is a preview)

the implementation is in Java and is part of the RefleX engine 
(http://reflex.gforge.inria.fr/) ; unfortunately, I didn't have yet 
published the last release with all that stuff ; however, you can browse 
the SVN repository if you are (very very very) curious :
https://gforge.inria.fr/plugins/scmsvn/viewcvs.php/root/src/java/org/inria/reflex/xml/filter/?rev=104&root=reflex
https://gforge.inria.fr/plugins/scmsvn/viewcvs.php/root/src/java/org/inria/reflex/xml/sax/?root=reflex

the new version of RefleX to come will supply a set of tags that allow 
to filter SAX streams with XPath patterns ; here are common use cases 
that XSLT users should find easy to understand :

<xcl:filter xmlns:xcl="http://www.inria.fr/xml/active-tags/xcl">

     <!-- copy -->
     <xcl:rule pattern="copy">
         <xcl:forward>
             <xcl:apply-rules/>
         </xcl:forward>
     </xcl:rule>

     <!--delete the element and its content-->
     <xcl:rule pattern="deleteElem"/>

     <!-- ignore an element, but apply rules on its content -->
     <xcl:rule pattern="ignoreElem">
         <xcl:forward>
             <insertedBefore/>
         </xcl:forward>
         <xcl:apply-rules/>
         <xcl:forward>
             <insertedAfter/>
         </xcl:forward>
     </xcl:rule>

     <!--insert a container-->
     <xcl:rule pattern="content">
         <xcl:forward>
             <insertedContainer>
                 <xcl:apply-rules/>
             </insertedContainer>
         </xcl:forward>
     </xcl:rule>

     <!--remove an attribute-->
     <xcl:rule pattern="removeAttr">
         <xcl:remove parent="{ . }" referent="{ @bar }"/>
         <xcl:forward>
             <xcl:apply-rules/>
         </xcl:forward>
     </xcl:rule>

     <!--remove all attributes-->
     <xcl:rule pattern="removeAllAttr">
         <xcl:remove parent="{ . }" referent="{ @* }"/>
         <xcl:forward>
             <xcl:apply-rules/>
         </xcl:forward>
     </xcl:rule>

     <!--change the value of an attribute-->
     <xcl:rule pattern="changeAttr">
         <xcl:attribute referent="{ . }" name="foo" value="foo"/>
         <xcl:forward>
             <xcl:apply-rules/>
         </xcl:forward>
     </xcl:rule>

</xcl:filter>

A filter reads entirely one or several inputs, and can produce several 
outputs. Unlike XSLT, an XCL filter traverses each input tree in its 
natural order only. More complex processes that require deep structure 
transformations should be considered with XSLT. XCL filters are suitable 
when processes are localized on independant chunks of datas, which is 
advantageous for stream-processing of large inputs, although XCL filters 
can be also convenient for traversing automatically a DOM tree. By 
combining other active tags with the small set defined here, it is yet 
possible to achieve efficient pipeline processes.

of course, you can combine these basic structures at will, as long as 
you use a single <xcl:apply-rules/> element

of course, several filters can be connected to a pipeline, including 
steps that are involving XSLT filtering and XInclude processing

of course, this kind of filter will be appliable on SAX streams and DOM 
trees, at user option

-- 
Cordialement,

               ///
              (. .)
  --------ooO--(_)--Ooo--------
|      Philippe Poulard       |
  -----------------------------
  http://reflex.gforge.inria.fr/
        Have the RefleX !


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.