[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: efficient traversal of combined collections in XSL

Subject: Re: efficient traversal of combined collections in XSLT 3.0
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Sat, 24 Nov 2012 15:27:24 +0000
Re:  efficient traversal of combined collections in XSL
The way we do this in maintaining the XSLT/XQuery specs (admittedly much smaller than your 4GB) is to maintain a derived document containing a list of valid link targets. This is regenerated when the base documents change, which is less frequently than the list is used. The list of valid anchors is much smaller than the base documents, so it can be loaded more quickly, and uses less memory.

Also, generating the list of anchors is an operation that can be streamed; hopefully the resulting list is small enough that it can be held in memory for look-up purposes.

Michael Kay
Saxonica

On 24/11/2012 13:53, Graydon wrote:
So I have about 4.0 GB of "production" content, XML that's already in use, can have deliverables generated from it, and which various groups of editors may change.

I have "content", some content (generally about .2 or .25 GB) that is being converted from SGML and which, before it is added to "production", needs to be checked to see if the links in it work.

links use a combination of @area (the name of a uniqueness of numbers) and @cite (the number); this is for legislation, so the numbers can get complicated by the basic scheme is pretty simple. (targets are one direction in a bi-directional relationship, so a link in a fancy hat; they usually contain links, and we only need to check them if they _don't_ contain a link.)

The slightly tricky bit is that I want to check the links in "content" to see if they match something in "content" _and_ in "production"; XSLT 3.0's version of key() will accept an arbitrary top-node, so (using the Saxon 9.4 which ships with current, 14.1 oXygen) I can declare the stylesheet to be version 3.0, combine "production" and "content" into "searchSpace", and define a key on that.

<xsl:stylesheet exclude-result-prefixes="xs xd" version="3.0"
   xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   <xsl:variable name="content" select="collection('file:///home/graydon/stages/APFF?recurse=yes;select=*.xml')"/>
   <xsl:variable name="production"
     select="collection('file:///home/graydon/stages/production/2012-11-13?recurse=yes;select=*.xml;on-error=ignore')"/>
   <xsl:variable name="searchSpace" select="($content,$production)"/>
   <xsl:key match="*[num[@cite]]" name="places" use="concat(ancestor-or-self::*[@area][1]/@area,'|',num[1]/@cite)"/>
   <xsl:template match="/">
     <bucket>
       <xsl:for-each select="$content//link,$content//target[not(reference-text/link)]">
         <xsl:choose>
           <xsl:when test="key('places',concat(current()/@area,'|',current()/@cite),$searchSpace)">
             <good>
               <uri>
                 <xsl:sequence select="base-uri(.)"/>
               </uri>
               <xsl:sequence select="."/>
             </good>
           </xsl:when>
           <xsl:otherwise>
             <bad>
               <uri>
                 <xsl:sequence select="base-uri(.)"/>
               </uri>
               <xsl:sequence select="."/>
             </bad>
           </xsl:otherwise>
         </xsl:choose>
       </xsl:for-each>
     </bucket>
   </xsl:template>
</xsl:stylesheet>

This works well on content-sized chunks of input (.25 GB or so) and I get an answer in about 15 seconds.

It doesn't work on the full data set; 16 GB of RAM isn't enough to do this to 4 GB of data. Various wheels are in motion to get more RAM.

So maybe everything will be fine, but I can't help looking at that code and going "this is a really naive search; there has to be a more efficient way to do this."

So, O XSLT List, what's the more efficient way to do this?

Thanks!

-- Graydon

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.