[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
[Recent Entries]
[Reply To This Message]
Re: Collect word count with xslt2.0 on saxon 8
Subject: Re: Collect word count with xslt2.0 on saxon 8
From: George Cristian Bina <george@xxxxxxxxxxxxx>
Date: Tue, 16 May 2006 10:04:22 +0300
|
Hello Karen,
You can get the count of words more easily than that. First you can get
the text in a variable that belongs to an element with topic/topic but
not to other elements inside it with the same mark and then just count
the words in that.
For getting the text once we match on a topic/topic element we use a new
mode for apply-template on which we do nothing on elements with
topic/topic thus we exclude their text content.
The following stylesheet shows that
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:template match="/">
<counts>
<xsl:apply-templates/>
</counts>
</xsl:template>
<xsl:template match="text()"/>
<xsl:template match="*[contains(@class, 'topic/topic')]">
<xsl:variable name="text">
<xsl:apply-templates mode="getText" select="node()"/>
</xsl:variable>
<record>
<text>
<xsl:value-of select="$text"/>
</text>
<count>
<xsl:value-of
select="count(tokenize(lower-case($text),'(\s|[,.!:;]|[n][b][s][p][;])+')[string(.)])"
/>
</count>
</record>
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="*[contains(@class, 'topic/topic')]"
mode="getText"/>
</xsl:stylesheet>
on your sample input it gives:
<?xml version="1.0" encoding="UTF-8"?>
<counts>
<record>
<text>
communications and information theory
top element
elements can be nested Generalized Markup
Language defined by ISO 8879.
</text>
<count>17</count>
</record>
<record>
<text>
communications and information theory
top element
elements can be nested (for a number of
technical reasons beyond the scope of this article).
</text>
<count>22</count>
</record>
<record>
<text>
communications and information theory
top element
elements can be nested maintain repositories
of structured documentation for more than a decade, but it is
not well
</text>
<count>25</count>
</record>
<record>
<text> But
the metrics for XML on the Web communications and
information theory
top element
elements can be nested measures, or are a
little polluted by voodoo ideology about good </text>
<count>29</count>
</record>
</counts>
Best Regards,
George
---------------------------------------------------------------------
George Cristian Bina
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Karen McAdams wrote:
I have the following structure that i need to collect
word counts for from each element that has a class
attribute that contains " topic/topic "
without counting its child elements that also contain
the the class attribute " topic/topic "
root>
<topic class=" topic/topic foo/bar ">
<p> communications and information theory</p>
<title> top element</title>
<relinfo> elements can be nested</relinfo>
Generalized Markup Language defined by ISO
8879.
<concept class=" topic/topic foo/bar ">
<p> communications and information
theory</p>
<title> top element</title>
<relinfo> elements can be nested</relinfo>
(for a number of technical reasons beyond
the scope of this article).
<topic class=" topic/topic foo/bar ">
<p> communications and information
theory</p>
<title> top element</title>
<relinfo> elements can be
nested</relinfo>
maintain repositories of structured
documentation for more than a decade, but it is not
well
<concept class=" topic/topic foo/bar
">
But the metrics for XML on the Web
<p> communications and
information theory</p>
<title> top element</title>
<relinfo> elements can be
nested</relinfo>
measures, or are a little polluted
by voodoo ideology about good
</concept>
</topic>
</concept>
</topic>
</root>
I have this template that gets the word count for each
element and its child elements including the elements
that have class attributes that contains "
topic/topic ".
<xsl:template match="*[contains(@class, 'topic/topic
')]">
<xsl:variable name="level"
select="count(ancestor::*[contains(@class,
'topic/topic ')]) + 1"/>
<xsl:variable name="ct" select="if ($level =
1) then concat(title,' ') else ' '"/>
<xsl:variable name="h1" select="if ($level =
2) then concat(title,' ') else ' '"/>
<xsl:variable name="h2" select="if ($level =
3) then concat(title,' ') else ' '"/>
<xsl:variable name="h3" select="if ($level =
4) then concat(title,' ') else ' '"/>
<xsl:variable name="wc"
select="count(tokenize(lower-case(.),'(\s|[,.!:;]|[n][b][s][p][;])+')[string(.)])"
/>
<xsl:apply-templates/>
</xsl:template>
I added another template that contains the count of
its child elements b
<xsl:template match="*[contains(@class,
'topic/topic ')]" mode="filterCount">
<sum>
<xsl:value-of
select="count(tokenize(lower-case(.),'(\s|[,.!:;]|[n][b][s][p][;])+')[string(.)])"/>
</sum>
</xsl:template>
That I store in a variable and then subtract from the
total within in the first template above
<xsl:variable name="childcounts">
<sums>
<xsl:apply-templates
mode="filterCount"/>
</sums>
</xsl:variable>
<xsl:variable name="total-child"
select="sum($childcounts/sums/sum)"/>
<xsl:variable name="total-roman"
select="sum($wc - $total-child)"/>
I would like to find a more elegant approach to this
because there are also other attributes in this
content that need to have the same technique applied
to b
Would it be a better approach to copy the elements to
another document node and then perform the word count
which would be applied recursively to all child
elements to arrive at the count and what would this
template match look like?
|
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format
RSS 2.0 |
|
Atom 0.3 |
|
|