[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: marking up text when term from other file is found

Subject: Re: marking up text when term from other file is found
From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
Date: Thu, 22 Apr 2010 13:54:42 +0200
Re:  marking up text when term from other file is found
Two comments and two questions.

C1:  The pattern containing all terms can be constructed once and not
repeatedly within the template doing the analyze-string.

C2:  The flags attribute of analyze-string should be used to do a case
insensitive match: flags='i'


Q1:  XSLT patterns don't have the zero-length assertion \b available
to match a word boundary. This may result in unexpected matches. With
analyze-string it is not possible to apply the usual trick of adding
an extra character before and after the string. So how can an exact
match be done here?

Q2: If the index or document is big, it might be faster to have
xsl:key on the indexTerms. Is it possible to construct such a key with
the matching string being the original <term/> content *in lowercase*?
Can it be done by constructing a temporary tree and applying xsl:key
to that?

-W


On Thu, Apr 22, 2010 at 8:21 AM, Mukul Gandhi <gandhi.mukul@xxxxxxxxx> wrote:
>
> I would try to solve this as, following:
>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>                       version="2.0">
>
>  <xsl:output method="xml" indent="yes" />
>
>  <xsl:variable name="index-terms" select="document('indexTerms.xml')" />
>
>  <xsl:template match="node() | @*">
>    <xsl:copy>
>          <xsl:apply-templates select="node() | @*" />
>        </xsl:copy>
>  </xsl:template>
>
>  <xsl:template match="text()" priority="10">
>         <xsl:analyze-string select="."
>                             regex="{string-join(for $term in
> $index-terms/terms/term return concat('(', $term, ')'), '|')}">
>            <xsl:matching-substring>
>                 <xsl:variable name="idVal" select="string-join(for $attrVal
in
> $index-terms/terms/term[. =
> regex-group(0)]/@*[starts-with(name(),'index')] return $attrVal, '_')"
> />
>                 <ph id="{$idVal}">
>                     <xsl:value-of select="." />
>                 </ph>
>           </xsl:matching-substring>
>           <xsl:non-matching-substring>
>               <xsl:value-of select="." />
>           </xsl:non-matching-substring>
>         </xsl:analyze-string>
>  </xsl:template>
>
> </xsl:stylesheet>
>
> You may adapt this, to suit your requirements if needed.
>
> On Thu, Apr 22, 2010 at 8:38 AM, Hoskins & Gretton
> <hoskgret@xxxxxxxxxxxxxxxx> wrote:
> >
> > HI, I need help finding resources (examples and/or XSL) for this
situation,
> > for which I haven't found quite the right recipe in my searches of the
list
> > archives.
> > Given an XML file containing a list of terms and another file containing
a
> > mix of elements containing text (narrative content, some inline markup
for
> > emphasis and footnotes), I was asked if I could find occurrences of each
> > term wherever it appeared in the narrative content, and wrap each
occurrence
> > with a tag. So my first thought is to load up each document into a
variable.
> > But then I don't know what the most effective method of string comparison
> > would be, given that the narrative document might have the term's words
with
> > different capitalization. If anyone can point me in the right direction,
I'd
> > appreciate it. Also I would like to know if there is a practical limit to
> > how large a narrative file I can run with about 150 terms to find in the
> >  text. And if a different approach  would work better, such as writing
Java
> > to do  brute force search and replace, please tell me so. (I work with a
> > Java programmer. Everything looks like a Java problem to her and an XSL
> > problem to me.)
> > -- Dorothy
> > Note: Using Saxon B 9.1.0.7. I just made up a set of terms and a bad
> > sentence as an example.
> > Example of terms (indexTerms.xml):
> > <?xml version="1.0" encoding="UTF-8"?>
> > <terms>
> >   <term index1="anxiety">Anxiety</term>
> >   <term index1="children">Children</term>
> >   <term index1="children" index2="illness">Children, illness</term>
> >   <term index1="children" index2="nightmare">Children, nightmare</term>
> >   <term index1="cure" index2="fever">Cure fever</term>
> >   <term index1="cure" index2="illness">Cure illness</term>
> >   <term index1="anxiety" index2="nightmare">Nightmare</term>
> >   <term index1="children" index2="illness">Sick children</term>
> >   <term index1="anxiety" index2="phobia">Worries, phobias and
anxiety</term>
> > </terms>
> >
> > Example of narrative (sampleTopic.xml):
> > <?xml version='1.0' encoding='UTF-8'?>
> > <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
> > "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd">
> > <topic id="sampleTopic">
> >  <title>sampleTopic</title>
> >  <body>
> >    <p>markup for sample terms testing a set of phrases to match to the
> > content of index terms:</p>
> >    <p>Texttexttext text some of the terms are already in &lt;ph&gt; i.e.
<ph
> > id="cure_fever">curing fever</ph>, <ph id="children_illness">sick
> > children</ph> and sometime the same terms occur, <i>but different
case</i>,
> > not in a ph: Curing fever and <b>Sick children</b>. I need to get all the
> > occurrences of each of the term element strings marked up with &lt;ph&gt;
> > </p>
> >  </body>
> > </topic>
> >
> > Desired result:
> > <?xml version='1.0' encoding='UTF-8'?>
> > <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
> > "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd">
> > <topic id="sampleTopic">
> >  <title>sampleTopic</title>
> >  <body>
> >    <p>markup for sample terms testing a set of phrases to match to the
> > content of index terms:</p>
> >    <p>Texttexttext text some of the terms are already in &lt;ph&gt; i.e.
<ph
> > id="cure_fever">curing fever</ph>, <ph id="children_illness">sick
> > children</ph> and sometime the same terms occur, <i>but different
case</i>,
> > not in a ph: <ph id="cure_fever">Curing fever</ph> and <b><ph
> > id="children_illness">Sick children</ph></b>. I need to get all the
> > occurrences of each of the term element strings marked up with &lt;ph&gt;
> > </p>
> >  </body>
> > </topic>
> >
> > XSL:
> > <?xml version="1.0" encoding="UTF-8"?>
> > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> > version="2.0">
> > <xsl:param name="indexFile">indexTerms.xml</xsl:param>
> > <xsl:param name="textFile">sampleTopic.xml</xsl:param>
> > <xsl:variable name="termsDocument"
> > select="document($indexFile)"></xsl:variable>
> > <xsl:variable name="textDocument"
> > select="document($textFile)"></xsl:variable>
> > <xsl:template match="*" name="test1"><xsl:result-document
> > href="matchText-test.xml" method="xml">
> > <!-- proof that I can get the terms -->
> > <xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>first term is
> > </xsl:text><xsl:value-of
> > select="$termsDocument/terms/term[1]"/></xsl:comment>
> > <xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>second term is
> > </xsl:text><xsl:value-of
> > select="$termsDocument/terms/term[2]"/></xsl:comment>
> > <xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>third term is
> > </xsl:text><xsl:value-of
> > select="$termsDocument/terms/term[3]"/></xsl:comment>
> > <!-- now how to I find them in the $textDocument elements and mark them
up?
> > -->
> > </xsl:result-document>
> > </xsl:template>
> > </xsl:stylesheet>
>
>
>
> --
> Regards,
> Mukul Gandhi

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.