[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: optimization of complex XPath
The O(n^2) is due to iterating over a list (for) and then iterating over the document tree to find //num/@cite. The following stylesheet should do it better,using xsl:key: <?xml version="1.0"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:strip-space elements="*"/> <xsl:key name="citekey" match="num" use="@cite"/> <xsl:template match="link"> <xsl:value-of select="(key('citekey',@cite),'ok',concat(@cite,' is missing'))[2]"/><xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet> I have tested it with a very primitive XML only: <doc> <foo> <num area="decisions" cite="1"/> <bar> <link area="decisions" cite="1"/> <link area="decisions" cite="2"/> </bar> </foo> </doc> -W On 19 November 2010 02:20, Graydon <graydon@xxxxxxxxx> wrote: > > A bit of background before I get to the question: > > I have about 2 GiB of XML which contains links. These links are > constructed as <link area="value" cite="numvalue"/>, and reference num > elements with matching cite values, <num cite="numvalue"/>. (The area > value is associated with an ancestor of the num element.) The guarantee > is that an area and cite pair is unique across the entire content set. > > The roughly-a-third of the content represented by that 2 GiB needs to be checked for broken links. (The other 2/3 needs to be converted from SGML, and _then_ it will need to be checked for broken links.) > > The currently converted content is all in the area "decisions", I can > only check links that point _to_ the area "decisions"; the internal > links of the already-converted content. That makes I have about 100,000 > out of the 250,000 links to check on this pass, but I will eventually > have to validate all the links in the whole (5+ GiB) content set. > > I am currently using the XPath expression: > > for $x in (//link[@area='decisions']/@cite) return > (if ($x = (//num/@cite)) > then 'good' > else concat('bad cite value ',$x,'
') > ) > > to check the links. (I don't need to qualify the num element in > //num/@cite with [ancestor::*[1]/@area='decisions'] because all the > currently converted content has an @area value of "decisions".) > > This works, and dumps out out a text list of bad links (and one "good" > per good link, so I can count them) but it is slow in ways that make me > think it's at least O(N^2) and probably worse. Performance is > reasonable on small subsets of the content but very slow on the whole of > the available XML. Since I have to check the whole thing as a unit to > properly test the ~two fifths of the links that refer to the decisions > area, and since I will eventually have to run the check on two or three > times as much content, O(N^2) is a problem; I can't say it takes a week > to run the link check. (Well, I can _say_ it, but it won't meet with a > pleased reception.) > > How can I make this check faster? > > I can run this as an XQuery via eXist, via XPath 2.0 using Saxon 9.2, or > (if really necessary) some other means, and would appreciate any > suggestions from the list's collective wisdom. > > Thanks! > Graydon
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|