[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: tokenize() and regex-group ?

Subject: Re: tokenize() and regex-group ?
From: Matthieu Ricaud-Dussarget <matthieu.ricaud@xxxxxxxxx>
Date: Wed, 18 Jul 2012 11:54:03 +0200
Re:  tokenize() and regex-group ?
Hi

Just a last word to say my problem is solved, thanks for your reactive and helpfull help !

Just a few comments here :

I used a self igs:tokenize-as-xml function that doesn't loose the "regex separator" (see last mail).
I just change the output of the function to be a single element with children :
<xsl:function name="igs:tokenize-as-xml" as="element(igs:tok)">
instead of a sequence of elements :
<xsl:function name="igs:tokenize-as-xml" as="element()*">


Why ? because it's seems one can not use "axes" (preceding-sibling::, << operator ...) "very well" within a sequence, one need a context.
I actually get some strange results when using :
<xsl:variable name="textBegin" select="string-join($tokenTextAsXML[ . &lt;&lt; $myFocusElement],'')"/>
(looks like a filter is added, selecting only node whose name is the same as $myFocusElement)
by the way myFocusElement is defined within the reccursion by : <xsl:variable name="myFocusElement" select="$tokenTextAsXML[last() - $lookBacklevel + 1]" as="element()"/>


when I used <xsl:function name="igs:tokenize-as-xml" as="element(igs:tok)"> and
<xsl:variable name="textBegin" select="string-join($tokenTextAsXML/igs:*[ . &lt;&lt; $myFocusElement],'')"/>
everything is going fine.


Well, I tried to simplifie the explanation, hope this is understandable.
Let see the real code at the bottom of this mail.

Best Regards,
Matthieu Ricaud.

<xsl:function name="igs:tokenize-as-xml" as="element(igs:tok)">
<xsl:param name="string" as="xs:string"/>
<xsl:param name="regex" as="xs:string"/>
<xsl:variable name="tmp" as="element()*">
<xsl:analyze-string select="$string" regex="{$regex}">
<xsl:matching-substring>
<igs:sep><xsl:value-of select="."/></igs:sep>
</xsl:matching-substring>
<xsl:non-matching-substring>
<igs:text><xsl:value-of select="."/></igs:text>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:variable>
<igs:tok>
<xsl:for-each-group select="$tmp" group-adjacent="local-name(.)='sep'">
<xsl:choose>
<xsl:when test="current-grouping-key()">
<igs:sep><xsl:value-of select="string-join(current-group(),'')"/></igs:sep>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</igs:tok>
</xsl:function>


(myFocusElement is called splitter in the real code)
<xsl:template name="addRefTheme">
<xsl:param name="text" as="xs:string"/>
<xsl:param name="lookBacklevel" select="1" as="xs:integer"/>
<xsl:variable name="tokenTextAsXML" select="igs:tokenize-as-xml($text,'(\s|\(|B+\p{Z}|\p{Z}B;|[lL]b)')" as="element()*"/>
<!-- the text is splitted in 2 parts, one will then try to get a corresponding anchor from the 2nd one-->
<xsl:variable name="tokenNum" select="count($tokenTextAsXML/igs:*)" as="xs:integer"/>
<xsl:variable name="spliter" select="$tokenTextAsXML/igs:*[last() - $lookBacklevel + 1]" as="element()"/>
<xsl:variable name="textBegin" select="string-join($tokenTextAsXML/igs:*[ . &lt;&lt; $spliter],'')"/>
<xsl:variable name="textEnd" select="string-join($tokenTextAsXML/igs:*[. &gt;&gt; $spliter or . is $spliter],'')"/>
<xsl:variable name="matchingAncres" select="$ancres[normalize-space($textEnd)!=''][igs:match-ancre(.,$textEnd)]" as="element()*"/>
<xsl:variable name="error.msg">
[ERROR][STEP7][ref:theme] <xsl:value-of select="count($matchingAncres)"/> ancre(s) trouvee(s) pour [text=<xsl:value-of select="concat($text,$asterix)"/>]<xsl:call-template name="lf"/>
<xsl:if test="$config/@debug='1'">
[lookBacklevel=<xsl:value-of select="$lookBacklevel"/>]<xsl:call-template name="lf"/>
[textBegin=<xsl:value-of select="$textBegin"/>]<xsl:call-template name="lf"/>
[textEnd=<xsl:value-of select="$textEnd"/>]<xsl:call-template name="lf"/>
</xsl:if>
</xsl:variable>
<xsl:variable name="ref_theme_override" select="$config/igs:ref_theme_override/igs:string[normalize-space(@value)=concat(normalize-space($textEnd),$asterix)]" as="element()?"/>
<xsl:choose>
<xsl:when test="count($ref_theme_override)=1">
<xsl:copy-of select="$ref_theme_override/node()" copy-namespaces="no"/>
</xsl:when>
<xsl:when test="count($matchingAncres)=1">
<xsl:value-of select="$textBegin"/>
<ref:theme idrefCorps="{$matchingAncres/@id}"><xsl:value-of select="concat($textEnd,$asterix)"/></ref:theme>
</xsl:when>
<xsl:when test="count($matchingAncres) gt 1">
<xsl:message><xsl:value-of select="$error.msg"/></xsl:message>
<xsl:value-of select="concat($text,$asterix)"/>
</xsl:when>
<xsl:when test="$lookBacklevel lt $tokenNum">
<xsl:call-template name="addRefTheme">
<xsl:with-param name="text" select="$text"/>
<xsl:with-param name="lookBacklevel" select="$lookBacklevel + 1"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:message><xsl:value-of select="$error.msg"/></xsl:message>
<xsl:value-of select="concat($text,$asterix)"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>




Le 17/07/2012 16:54, Matthieu Ricaud-Dussarget a C)crit :
Thank you Michael.

At first I used <xsl:analyze-string> before I realized I have to check for one *or more* words to match the definition.
I thought using tokenize would help with going back reccursively from word to word into the string (help by position() predicates).
But well I did not thought about problems with differents separator pattern (the igs:match-ancre() function is permissive with this... but the output tagging is not good, eg. <link idref="#foobar">B+ foo bar*</link> B; shall better be B+ <link idref="#foobar">foo bar*</link> B;).


As usual you're right :-) I have to go back with <xsl:analyze-string>

The problem I suspected was about a regex witch match 1 or 2 or N words, something like $wordRegex$sepRegex{{$lookBacklevel}}

After your emphazing fn:tokenize(), I finaly started with another way of doing it with the help of af function that tokenize as XML :

<xsl:function name="igs:tokenize-as-xml" as="element()*">
<xsl:param name="string" as="xs:string"/>
<xsl:param name="regex" as="xs:string"/>
<xsl:variable name="tmp" as="element()*">
<xsl:analyze-string select="$string" regex="{$regex}">
<xsl:matching-substring>
<igs:sep><xsl:value-of select="."/></igs:sep>
</xsl:matching-substring>
<xsl:non-matching-substring>
<igs:text><xsl:value-of select="."/></igs:text>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:for-each-group select="$tmp" group-adjacent="local-name(.)='sep'">
<xsl:choose>
<xsl:when test="current-grouping-key()">
<igs:sep><xsl:value-of select="string-join(current-group(),'')"/></igs:sep>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:function>


Hope I can bind everything, I tell you about this when I finish.

Regards,
Matthieu.

Le 17/07/2012 15:22, Michael Kay a C)crit :
You need to use xsl:analyze-string. I don't understand the difficulties in using this inside a recursive template. xsl:analyze-string can do everything that tokenize can do; you could implement tokenize as

<xsl:function name="fn:tokenize" as="xs:string">
  <xsl:param name="in" as="xs:string"/>
  <xsl:param name="regex" as="xs:string"/>
  <xsl:analyze-string select="$in" regex="{$regex}"/>
    <xsl:matching-substring/>
    <xsl:non-matching-substring>
       <xsl:sequence select="."/>
    </xsl:non-matching-substring>
</xsl:function>

Start be replacing your call to tokenize with a call to that function, then add whatever functionality you need.

Michael Kay
Saxonica

On 17/07/2012 14:02, Matthieu Ricaud-Dussarget wrote:
Hi all,

I'm tokenizing some text within a reccursiv template. The goal is to generates some linking with some "definitions" inside the doc.
Let say my text is : "my foo bar"
=> 1st level of reccursion is searching for "bar" as defined anchor in the doc
if not found, I increase a $lookBacklevel param :
=> 2nd level of reccursion is searching for "foo bar"
and so on... till it finds a matching definition or throw an error if not.
=> when a definition is found, the text is output with a link :
<p>... my <link idref="#anchorFooBar">foo bar</link> ...</p>


To do so I (space-) tokenized the text :
<xsl:variable name="tokenText" select="tokenize($text,' ')" as="xs:string*"/>


and then make 2 strings depending on reccursion param $lookBacklevel
<xsl:variable name="textBegin" select="string-join($tokenText[position() lt ($tokenNum - $lookBacklevel + 1)],' ')"/>
<xsl:variable name="textEnd" select="string-join($tokenText[position() ge ($tokenNum - $lookBacklevel + 1)],' ')"/>


I then search for a matching definition :
<xsl:variable name="matchingAncres" select="$ancres[normalize-space($textEnd)!=''][igs:match-ancre(.,$textEnd)]" as="element()*"/>
(matching rules are defined in a specific function)


The problem I've got is that the tokenize separator is too specific, it's only a space, and sometime words are separated by other char like :
- unbreakable space "&#160;"
- open parenthese "("
- french quotes "B+"
- ...


I could use a regex like "[\s(]B+" as 2nd arg of tokenize() but, I will then not be able to reconstruct the string.

So is there a way to get the separator that has been match in the regex of tokenize() ?
just like regex-group() do when using <xsl:analyze-string> ?


I think the answer is "no", but maybe I'm missing a trick to achieve this ?

I could maybe use <xsl:analyse-string> but this is not so easy because of the reccursiv template, the regex will depend on $lookBacklevel param. I'm not sure I can fin the good pattern...

Regards,

Matthieu.





--
Matthieu Ricaud
05 45 37 08 90
IGS-CP, service livres numC)riques

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.