[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: how to extract words from a text

Subject: Re: how to extract words from a text
From: JBryant@xxxxxxxxx
Date: Fri, 10 Dec 2004 14:51:32 -0600
extract words from text
I decided to take a whack at it and came up with the following XSL file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
<xsl:output
  method="text"
  omit-xml-declaration="yes"
  indent="no"
/>

  <xsl:template match="text">
    <xsl:call-template name="makeList">
      <xsl:with-param name="textIn" select="translate(., ',', '')"/>
      <xsl:with-param name="wordsSoFar"/>
    </xsl:call-template>
  </xsl:template>

  <xsl:template name="makeList">
    <xsl:param name="textIn"/>
    <xsl:param name="wordsSoFar"/>
    <xsl:choose>
      <xsl:when test="contains($textIn, ' ')">
        <xsl:variable name="firstWord" select="substring-before($textIn, '
')"/>
        <xsl:choose>
          <xsl:when test="string-length($firstWord)>2 and
not(contains($wordsSoFar, $firstWord))">
            <xsl:variable name="newString">
              <xsl:choose>
                <xsl:when test="string-length($wordsSoFar)=0">
                  <xsl:value-of select="$firstWord"/>
                </xsl:when>
                <xsl:otherwise>
                  <xsl:value-of select="$firstWord"/><xsl:text>,
</xsl:text><xsl:value-of select="$wordsSoFar"/>
                </xsl:otherwise>
              </xsl:choose>
            </xsl:variable>
            <xsl:call-template name="makeList">
              <xsl:with-param name="textIn"
select="substring-after($textIn, ' ')"/>
              <xsl:with-param name="wordsSoFar" select="$newString"/>
            </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
            <xsl:call-template name="makeList">
              <xsl:with-param name="textIn"
select="substring-after($textIn, ' ')"/>
              <xsl:with-param name="wordsSoFar" select="$wordsSoFar"/>
            </xsl:call-template>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:when>
      <xsl:otherwise>
        <xsl:choose>
          <xsl:when test="string-length($textIn)>2">
            <xsl:choose>
              <xsl:when test="contains($wordsSoFar, $textIn)">
                <xsl:value-of select="$wordsSoFar"/>
              </xsl:when>
              <xsl:otherwise>
                <xsl:value-of select="$textIn"/><xsl:text>,
</xsl:text><xsl:value-of select="$wordsSoFar"/>
              </xsl:otherwise>
            </xsl:choose>
          </xsl:when>
          <xsl:otherwise>
            <xsl:value-of select="$wordsSoFar"/>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

</xsl:stylesheet>

When run against the following XML file:

<root>
  <text>This is a text, that is a text</text>
</root>

it produces the following output:

that, text, This

Note that it does not handle case, so 'Text' and 'text' are different
words. I only have so much time to fiddle, so I didn't get that far. Also,
I expect that other, more-experienced, folks around here can produce a
better implementation. Still, this one works.

Jay Bryant
Bryant Communication Services




JBryant@xxxxxxxxx
12/10/2004 01:32 PM
Please respond to
xsl-list@xxxxxxxxxxxxxxxxxxxxxx


To
xsl-list@xxxxxxxxxxxxxxxxxxxxxx
cc

Subject
Re:  how to extract words from a text






> And look at substring-after() or substring-before() and a recursive
template...

Bingo. If I were going to try this, I would write a recursive template
that nibbled the first word off the string, checked its length, kept it if

3+ characters or tossed it if too short, and then passed the remaining
string to the next instance of the template. Once no spaces remain in the
string, it's done.

Jay Bryant
Bryant Communication Services




Antsnio Mota <xptm@xxxxxxx>
12/10/2004 01:05 PM
Please respond to
xsl-list@xxxxxxxxxxxxxxxxxxxxxx


To
xsl-list@xxxxxxxxxxxxxxxxxxxxxx
cc

Subject
Re:  how to extract words from a text






I have no idea too, specially on a friday this hour...

But maybe this give _you_ something to think about. It's a "word count"
method.

<xsl:variable name="txt"><xsl:value-of select="text" /></xsl:variable>
<xsl:variable name="x" select="normalize-space($txt)" />
<xsl:variable name="y" select="translate($txt, ' ', '')" />
<xsl:variable name="wc" select="string-length($x) - string-length($y) +1"
/>

so wc (word count) in your example will be 8...

And look at substring-after() or substring-before() and a recursive
template...


Quoting Jan Limpens <jan.limpens@xxxxxxxxx>:

> hello again,
>
> I hope you can help me with this one just as well, as with my other
> question today! :)
>
> i have a xml document
> <root>
> <text>This is a text, that is a text</text>
> </root>
>
> and I need to extract every word from it - once, ignoring case, and
> ordered by ocurrence, stripping 1-2 letter words - to make a meta
> keywords tag from it...
>
> <meta name="keywords" content="text, that, this"/>
>
> the horror! the horror! I have no idea how to do this! :)
>
> thanks again!
> --
> Jan
> http://www.limpens.com
>
> Otakoo Saloon Cartoon - newest episode at http://limpens.com/oscredirect
>
>





O SAPO ja esta livre de vmrus com a Panda Software, fique vocj tambim!
Clique em: http://antivirus.sapo.pt

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.