[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: finding a minimum set of references
On Wed, Jan 09, 2013 at 04:56:44AM -0800, Dimitre Novatchev scripsit: > This can be done in three steps: > > 1. Get the distinct values of concat(@area,'+',@site) for all tree elements. > > 2. Get a sorted sequence of the corresponding `loc` elements that > were produced in step 1. The sort is by the count of `tree` elements > that have that particular loc element as descendant. > > 3. Recursively add to the "minList" the top of the sorted loc > elements that isn't yet in this "minList". Stop condition -- when > there is nothing more to add. This is very similar to what I did, after having woken up at the 3 AM local going "wait! I can do this!" and making rapid notes. (Having learnt what happens should I _not_ make those notes!) (Much thanks to Wolfgang for saying "use keys!" last night.) What I did was to go through each tree element's loc element children, and count how many other such loc elements existed anywhere in the data set. (Keys were very useful for this, and just let me note that trying to stick the result of a key lookup that returns a node in a comment via xsl:sequence during debugging is something I should remember not to do more reliably.) Once I had that count associated with each loc element (@other), I deleted all the loc elements but the one with the highest value for @other. Third step was to make a unique list of the @area/@cite pairs contained in those surviving loc elements, and consider that to represent the minimum spanning set of documents that contain all the element parent/child pattern examples. (There are other steps about "the example of this parent/child pattern is in this document" and fetching those documents and so on, but those aren't in the code sample below, which is already plenty long enough.) > I believe that this would really produce the minimum sequence of > locations -- due to the fact that we pick locations in decreasing > order of trees that reference them. I think what I've got, which I suspect is equivalent in a graph-theoretic sense, also does this. I am also really really glad my current clients aren't going to ask me to _prove_ that. The input set this was tested on had 19,503 documents in it; the minimum spanning set of examples that results has 294 documents in it, so about 1.5% of the total. If that's not actually minimal, it's close enough. Total, end-to-end run time (via Saxon 9.4.something in oXygen), from "make a collection out of this half-GB pile of documents" to "minimum spanning set of documents in a pile" is about 65 seconds. (The bit below, run independently on the big set of tree elements, takes about a second.) I'm really pleased with this result. (Some of the previous attempts were taking more than half an hour of run time to not work....) > I would be willing to produce the code that implements this algorithm, > but I would need to be provided with a complete (but not too long) > source XML document -- so that the result could be easily verified That's extraordinarily kind of you, but I really did just need the algorithm hint. (This time, at least! :) I'm greatly reassured at the similarity of the suggested methodology, too! Thanks, Dimitre! -- Graydon <xsl:stylesheet exclude-result-prefixes="xs xd" version="2.0" xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xd:doc scope="stylesheet"> <xd:desc> <xd:p><xd:b>Created on:</xd:b> Jan 8, 2013</xd:p> <xd:p><xd:b>Author:</xd:b> graydon</xd:p> <xd:p/> </xd:desc> </xd:doc> <xsl:key match="//loc" name="locKey" use="concat(@area,'|',@cite)"/> <xsl:template match="/"> <xsl:variable name="getLocCounts"> <xsl:apply-templates mode="countLoc" select="node()"/> </xsl:variable> <xsl:variable name="findMostOther"> <xsl:apply-templates mode="mostOther" select="$getLocCounts"/> </xsl:variable> <xsl:variable name="makeDocumentList"> <xsl:apply-templates mode="docList" select="$findMostOther"/> </xsl:variable> <xsl:sequence select="$makeDocumentList"/> </xsl:template> <xsl:template match="node()|@*" mode="countLoc mostOther docList prettify"> <xsl:copy> <xsl:apply-templates mode="#current" select="node()|@*"/> </xsl:copy> </xsl:template> <xsl:template match="loc" mode="countLoc"> <xsl:copy> <xsl:attribute name="other" select="count(key('locKey',concat(@area,'|',@cite))[not(. is current())])"/> <xsl:apply-templates mode="countLoc" select="node()|@*"/> </xsl:copy> </xsl:template> <xsl:template match="locations" mode="mostOther"> <xsl:copy> <xsl:for-each-group group-by="@other" select="loc"> <xsl:sort data-type="number" order="descending" select="@other"/> <xsl:if test="position() eq 1"> <xsl:sequence select="current-group()[1]"/> </xsl:if> </xsl:for-each-group> </xsl:copy> </xsl:template> <xsl:template match="container" mode="docList"> <wkna-shared-cms> <assembly area="testpub" xml:lang="en"> <xsl:for-each-group group-by="concat(@area,'|',@cite)" select="tree/locations/loc"> <xsl:sort select="current-grouping-key()"/> <include area="{current-group()[1]/@area}" cite="{current-group()[1]/@cite}"/> </xsl:for-each-group> </assembly> </wkna-shared-cms> </xsl:template> </xsl:stylesheet>
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|