[XQuery Talk Mailing List Archive Home] [By Date] [By Thread] [By Subject] [By Author] [Recent Entries] [Reply To This Message]

Count a specific word in a document

Michael Kay mike at saxonica.com
Wed Jun 13 12:18:53 PDT 2007


  Count a specific word in a document
>   for $elijah in doc("/db/mjs/ElijahLibretto.xhtml")/html
>   let $elijah-para := $elijah//td/p[i/text() = 'Elijah']
>   let $txt := string-join($elijah-para/text(), " ")

You haven't shown your source document, but the above seems surprising to
me. If the paragraph in question has

<p><i>Elijah</i> Rise then ye priests of Baal, select and slay a bullock,
and I then will call on the Lord Jehovah</p>

then this will work. But in general, when a paragraph has mixed content,
then using /text() is dangerous, because it loses content that is in nested
markup. For example it would fail with:

<p><i>Obadiah</i> <quote>If with all your hearts you truly seek me, ye shall
ever surely find me.</quote> Thus saith our God.</p>

I would normally expect to see

let $txt := string($elijah-para)

(except that this will probably be done implicitly anyway).

In fact the use of /text() is very common in XQuery circles, and in my view
it's usually wrong. You nearly always want the string value of the element
rather than its text node children: /string() rather than /text(), except as
I say that it's usually implicit.


>   let $words := tokenize($txt,"(\s|[,.!:;]|[n][b][s][p][;])+")

A strange regular expression this. Firstly, '[n][b][s][p][;]' can be written
'nbsp;'. But I wouldn't normally expect to see nbsp; in your source. If
there's an entity reference &nbsp; in your lexical XML then the text node
will contain an xA0 character, and it is this that you should match, by
using '&#xa0;' in your regular expression. But a better regex is \W+, which
matches all "non-word" characters. 
> 
> I can't figure out how to count the number of string tokens 
> that are 'Lord'. I can get them with:
> 
>   for $word in $words
>   return $word[$word = 'Lord']
> 
> but I can't seem to get the count of them.

count(tokenize($txt, '\W+')[.='Lord'])

Michael Kay
http://www.saxonica.com/



Purchase Stylus Studio Online Today!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2007 All Rights Reserved.