optimization of complex XPath

Play the video

Subject: optimization of complex XPath
From: Graydon <graydon@xxxxxxxxx>
Date: Thu, 18 Nov 2010 20:20:52 -0500

A bit of background before I get to the question:

I have about 2 GiB of XML which contains links.  These links are
constructed as <link area="value" cite="numvalue"/>, and reference num
elements with matching cite values, <num cite="numvalue"/>.  (The area
value is associated with an ancestor of the num element.)  The guarantee
is that an area and cite pair is unique across the entire content set.

The roughly-a-third of the content represented by that 2 GiB needs to be checked for broken links. (The other 2/3 needs to be converted from SGML, and _then_ it will need to be checked for broken links.)

The currently converted content is all in the area "decisions", I can
only check links that point _to_ the area "decisions"; the internal
links of the already-converted content.  That makes I have about 100,000
out of the 250,000 links to check on this pass, but I will eventually
have to validate all the links in the whole (5+ GiB) content set.

I am currently using the XPath expression:

for $x in (//link[@area='decisions']/@cite) return
    (if ($x = (//num/@cite))
     then 'good'
     else concat('bad cite value ',$x,'&#x000a;')
    )

to check the links. (I don't need to qualify the num element in
//num/@cite with [ancestor::*[1]/@area='decisions'] because all the
currently converted content has an @area value of "decisions".)

This works, and dumps out out a text list of bad links (and one "good"
per good link, so I can count them) but it is slow in ways that make me
think it's at least O(N^2) and probably worse.  Performance is
reasonable on small subsets of the content but very slow on the whole of
the available XML.  Since I have to check the whole thing as a unit to
properly test the ~two fifths of the links that refer to the decisions
area, and since I will eventually have to run the check on two or three
times as much content, O(N^2) is a problem; I can't say it takes a week
to run the link check.  (Well, I can _say_ it, but it won't meet with a
pleased reception.)

How can I make this check faster?

I can run this as an XQuery via eXist, via XPath 2.0 using Saxon 9.2, or
(if really necessary) some other means, and would appreciate any
suggestions from the list's collective wisdom.

Thanks!
Graydon

Current Thread
optimization of complex XPath Graydon - 19 Nov 2010 01:21:10 -0000 <= Liam R E Quin - 19 Nov 2010 01:30:11 -0000 Wolfgang Laun - 19 Nov 2010 08:26:05 -0000 Graydon - 19 Nov 2010 10:19:48 -0000 Wolfgang Laun - 19 Nov 2010 10:38:51 -0000

<- Previous	Index	Next ->
Re: Selecting the node which , Mukul Gandhi	Thread	Re: optimization of complex X, Liam R E Quin
Re: Selecting the node which , Piet van Oostrum	Date	Re: optimization of complex X, Liam R E Quin
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >