[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: grouping and word counting

Subject: Re: grouping and word counting
From: "Dimitre Novatchev" <dnovatchev@xxxxxxxxx>
Date: Sat, 19 Jul 2003 18:56:04 +0200
grouping in word
Hi Marina,

One can use the string tokeniser from FXSL (the "str-split-to-words"
template) in order to obtain a list of words from a string and then count
them.

This, combined with the Muenchian method for grouping gives us the following
solution.

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:ext="http://exslt.org/common"
 exclude-result-prefixes="ext">

 <xsl:import href="strSplit-to-Words.xsl"/>

  <xsl:output method="text"/>

  <xsl:key name="kMsg" match="MESSAGE" use="."/>

  <xsl:key name="kByCount" match="m" use="@count"/>

  <xsl:template match="/">
    <xsl:variable name="vPass1">
      <xsl:for-each
        select="/*/*/MESSAGE[generate-id()
                            =
                             generate-id(key('kMsg',
                                             .
                                             )[1]
                                         )
                             ]">
         <xsl:sort select="count(key('kMsg',.))"
                   data-type="number"/>
         <m count="{count(key('kMsg',.))}"
            text="{.}"/>
      </xsl:for-each>
    </xsl:variable>

    <xsl:for-each
    select="ext:node-set($vPass1)/m
                   [generate-id()
                   =
                    generate-id(key('kByCount',
                                     @count
                                    )[1]
                                )
                   ]">
      <xsl:sort select="count(key('kByCount', @count))"
           data-type="number"/>

      <xsl:variable name="vAllText">
        <xsl:for-each select="key('kByCount', @count)">
          <xsl:value-of select="concat(' ', @text, ' ')"/>
        </xsl:for-each>
      </xsl:variable>

      <xsl:variable name="vrtfWords">
        <xsl:call-template name="str-split-to-words">
          <xsl:with-param name="pStr" select="$vAllText"/>
          <xsl:with-param name="pDelimiters" select="' '"/>
        </xsl:call-template>
      </xsl:variable>

      <xsl:variable name="vAvWords"
       select="(count(ext:node-set($vrtfWords)/word) - 1)
             div
               count(key('kByCount', @count))"/>

      <xsl:value-of select="concat(count(key('kByCount',
                                              @count
                                             )
                                         ),
                                   ' ',
                                   @count,
                                   ' ',
                                   $vAvWords,
                                   '&#xA;'
                                   )"/>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>


when applied on your source.xml:

<LOG>
  <SENT>
    <USER> 12345 </USER>
    <LOCATION> 55555 </LOCATION>
    <TARGET> 1 </TARGET>
    <TARGET_LOCATION> 23222 </TARGET_LOCATION>
    <MESSAGE> hello Fred </MESSAGE>
  </SENT>
  <SENT>
    <USER> 77777 </USER>
    <LOCATION> 76666 </LOCATION>
    <TARGET> 3 </TARGET>
    <TARGET_LOCATION> 34444 </TARGET_LOCATION>
    <MESSAGE> nice weather </MESSAGE>
  </SENT>
  <SENT>
    <USER> 77777 </USER>
    <LOCATION> 76666 </LOCATION>
    <TARGET> 4 </TARGET>
    <TARGET_LOCATION> 67777 </TARGET_LOCATION>
    <MESSAGE> nice weather </MESSAGE>
  </SENT>
  <SENT>
    <USER> 33333 </USER>
    <LOCATION> 12666 </LOCATION>
    <TARGET> 8 </TARGET>
    <TARGET_LOCATION> 98765 </TARGET_LOCATION>
    <MESSAGE> whats the latest news? </MESSAGE>
  </SENT>
  <SENT>
    <USER> 33333 </USER>
    <LOCATION> 12666 </LOCATION>
    <TARGET> 9 </TARGET>
    <TARGET_LOCATION> 46578 </TARGET_LOCATION>
    <MESSAGE> whats the latest news? </MESSAGE>
  </SENT>
</LOG>


produces the wanted result:

1 1 2
2 2 3


Hope this helped.


=====
Cheers,

Dimitre Novatchev.
http://fxsl.sourceforge.net/ -- the home of FXSL


"marina" <marina777uk@xxxxxxxxx> wrote in message
news:20030719075801.60127.qmail@xxxxxxxxxxxxxxxxxxxxxxxxxx
> Hi,
>
> I have an XML document that contains messages sent by
> people to one another. Many of these messages in the
> <MESSAGE> tags are repeated as they are sent by one
> person to many others.
>
> XML Snippet:
> --------------------------------------------------
> <LOG>
>    <SENT>
>       <USER> 12345 </USER>
>       <LOCATION> 55555 </LOCATION>
>       <TARGET> 1 </TARGET>
>       <TARGET_LOCATION> 23222 </TARGET_LOCATION>
>       <MESSAGE> hello Fred </MESSAGE>
>    </SENT>
>    <SENT>
>       <USER> 77777 </USER>
>       <LOCATION> 76666 </LOCATION>
>       <TARGET> 3 </TARGET>
>       <TARGET_LOCATION> 34444 </TARGET_LOCATION>
>       <MESSAGE> nice weather </MESSAGE>
>    </SENT>
>    <SENT>
>       <USER> 77777 </USER>
>       <LOCATION> 76666 </LOCATION>
>       <TARGET> 4 </TARGET>
>       <TARGET_LOCATION> 67777 </TARGET_LOCATION>
>       <MESSAGE> nice weather </MESSAGE>
>    </SENT>
>    <SENT>
>       <USER> 33333 </USER>
>       <LOCATION> 12666 </LOCATION>
>       <TARGET> 8 </TARGET>
>       <TARGET_LOCATION> 98765 </TARGET_LOCATION>
>       <MESSAGE> whats the latest news? </MESSAGE>
>    </SENT>
>    <SENT>
>       <USER> 33333 </USER>
>       <LOCATION> 12666 </LOCATION>
>       <TARGET> 9 </TARGET>
>       <TARGET_LOCATION> 46578 </TARGET_LOCATION>
>       <MESSAGE> whats the latest news? </MESSAGE>
>    </SENT>
> </LOG>
> --------------------------------------------------
> What I need to do is:-
>
> 1) Find out how many messages over all were sent to 1,
> 2, 3 etc people.
>
> As a duplicated message will always follow the
> original, i.e. be the next <MESSAGE> tag of the
> following sibling node, I'm thinking that the
> stylesheet would start with the first message and keep
> comparing siblings until it found one that was
> different. Then it would just add the previous number
> of sibling nodes? ( I probably need to use keys?)
>
> 2) For each of the total messages per group size,
> calculate the average number of words. No idea on this
> one I'm afraid!
>
> So the desired output from the snippet above would be:
> -
>
> Group Size Number of Messages Av Number Words
>     1 1 2
>     2 2 3
>  (up to say 20)
>
> Many thanks in advance for any help,
>
> Marina
>
>
>
>
> __________________________________
> Do you Yahoo!?
> SBC Yahoo! DSL - Now only $29.95 per month!
> http://sbc.yahoo.com
>
>  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
>
>




 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.