[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: Calculating groups of repeating elements

Subject: RE: Calculating groups of repeating elements
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 11 Dec 2008 19:56:00 -0000
RE:  Calculating groups of repeating elements
Given that you have 250 places and 75 words, it seems to me that one could
approach this as follows:

(1) Construct all pairs of words that appear together in at least one place,
and find all the places in which this pair occurs. Discard the groups that
appear in only one place. This gives you all the groups of size 2.

(2) For each group of size 2, and the set of places in which this group
appears, construct candidate groups of size 3 by combining the 2-group with
all the other words appearing in those places. Again, discard the 3-groups
that appear in only one place. This gives you all the 3-groups

(3) Continue recursively to construct the 4-groups, 5-groups, etc

(4) Now add a step (0) before step (1) which constructs the 1-groups; the
expansion to 2-groups uses exactly the same logic.

Now finding the 1-groups looks like this:

<xsl:function name="f:level-one-groups" as="element(group)*">
<for-each-group select="//word" group-by=".">
  <xsl:if test="count(current-group()) gt 1">
    <group>
      <word><xsl:value-of select="current-grouping-key()"/></word>
      <xsl:copy-of select="place_number"/>
    </group>
  </xsl:if>
</xsl:for-each-group> 
</xsl:function>

and finding the level-n groups given the level-n-1-groups is:

<xsl:function name="f:level-n-groups" as="element(group)*">
  <xsl:param name="level-n-1-groups" as="element(group)*"/>
      <xsl:variable name="g" select="$level-n-1-groups"/>
      <xsl:for-each select="$g">
        <xsl:variable name="places" 
                      select="//place[place-number = $g/place_number]"/>
        <xsl:variable name="otherWords" 
                      select="$places/words/word[not(. = $g/word)]"/>
        <xsl:variable name="n-gram" 
                      select="$g/word, ."/>
        <xsl:variable name="places-with-n-gram" 
                      select="$places[every $w in n-gram satisfies
./words/word = $w]"/>
        <xsl:if test="count($places-with-n-gram) gt 1">
          <group>
            <xsl:for-each select="$n-gram">
               <word><xsl:value-of select="."/>
            </xsl:for-each>
            <xsl:for-each select="$places-with-n-gram">
               <word><xsl:value-of select="."/>
            </xsl:for-each>
          </group>
        </xsl:if>
     </xsl:for-each>
</xsl:function>

then you just have to do a recursion in which you start by finding the
level-1 groups, then recurse to levels 2, 3, 4 etc until you find there are
no groups at a particular level.

Not tested, of course, and almost certainly capable of improvement.

Michael Kay
http://www.saxonica.com/
          

> -----Original Message-----
> From: Quinn Dombrowski [mailto:qdombrow@xxxxxxxx] 
> Sent: 10 December 2008 20:16
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject:  Calculating groups of repeating elements
> 
> Hello,
> 
> I'm trying to calculate all of the groups of 2+ elements (in 
> the sample data below, words) that appear together in more 
> than one place. Ideally, I'd like to be able to sort 
> descending both by length of group (5-word group, 4-word 
> groups, etc), and by number of places the groups occur (100 
> places, 99 places, etc.) I also need to be able to list the 
> place numbers where they occur.
> 
> I started doing it manually this way but the number of 
> possible combinations quickly became too big a task:
> 
> <xsl:template match="/">
> <xsl:value-of 
> select="count(atlas/place/place_number[../words/word='Aa']
> intersect atlas/place/place_number[../words/word='C'])"/>
> </template>
> (adding more "intersects" as necessary, and getting rid of 
> the "count" 
> to see the place numbers)
> 
> Here's a sample of the data. Almost every word appears in 
> multiple places, but each appears only once in the index, 
> which I've used in other applications for matching to avoid 
> re-calculating stats for the word over and over. Any help 
> would be wonderful!
> 
> <atlas>
> <place>
> <place_number>1</place_number>
> <words>
> <word>Aa</word>
> <word>C</word>
> <word>Qqq</word>
> </words>
> </place>
> 
> <place>
> <place_number>2</place_number>
> <words>
> <word>Aa</word>
> <word>Bbbb</word>
> <word>C</word>
> <word>W</word>
> <word>Zz</word>
> </words>
> </place>
> 
> <place>
> <place_number>3</place_number>
> <words>
> <word>Aa</word>
> <word>C</word>
> <word>Bb</word>
> <word>Qqq</word>
> <word>Wwww</word>
> <word>Zz</word>
> </words>
> </place>
> 
> [etc]
> 
> <index>
> 
> <index_entry>
> <underlying_word>A</underlying_word>
> <word>A</word>
> <word>Aa</word>
> <word>Aaa</word>
> </index_entry>
> 
> <index_entry>
> <underlying_word>B</underlying_word>
> <word>Bb</word>
> <word>Bbbb</word>
> </index_entry>
> 
> [etc]
> 
> </index>
> </atlas>

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.