[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

optimization of complex XPath

Subject: optimization of complex XPath
From: Graydon <graydon@xxxxxxxxx>
Date: Thu, 18 Nov 2010 20:20:52 -0500
 optimization of complex XPath
A bit of background before I get to the question:

I have about 2 GiB of XML which contains links.  These links are
constructed as <link area="value" cite="numvalue"/>, and reference num
elements with matching cite values, <num cite="numvalue"/>.  (The area
value is associated with an ancestor of the num element.)  The guarantee
is that an area and cite pair is unique across the entire content set.

The roughly-a-third of the content represented by that 2 GiB needs to be checked for broken links. (The other 2/3 needs to be converted from SGML, and _then_ it will need to be checked for broken links.)

The currently converted content is all in the area "decisions", I can
only check links that point _to_ the area "decisions"; the internal
links of the already-converted content.  That makes I have about 100,000
out of the 250,000 links to check on this pass, but I will eventually
have to validate all the links in the whole (5+ GiB) content set.

I am currently using the XPath expression:

for $x in (//link[@area='decisions']/@cite) return
    (if ($x = (//num/@cite))
     then 'good'
     else concat('bad cite value ',$x,'&#x000a;')
    )

to check the links. (I don't need to qualify the num element in
//num/@cite with [ancestor::*[1]/@area='decisions'] because all the
currently converted content has an @area value of "decisions".)

This works, and dumps out out a text list of bad links (and one "good"
per good link, so I can count them) but it is slow in ways that make me
think it's at least O(N^2) and probably worse.  Performance is
reasonable on small subsets of the content but very slow on the whole of
the available XML.  Since I have to check the whole thing as a unit to
properly test the ~two fifths of the links that refer to the decisions
area, and since I will eventually have to run the check on two or three
times as much content, O(N^2) is a problem; I can't say it takes a week
to run the link check.  (Well, I can _say_ it, but it won't meet with a
pleased reception.)

How can I make this check faster?

I can run this as an XQuery via eXist, via XPath 2.0 using Saxon 9.2, or
(if really necessary) some other means, and would appreciate any
suggestions from the list's collective wisdom.

Thanks!
Graydon

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.