[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: tokenize() and regex-group ?

Subject: Re: tokenize() and regex-group ?
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Tue, 17 Jul 2012 14:22:21 +0100
Re:  tokenize() and regex-group ?
You need to use xsl:analyze-string. I don't understand the difficulties in using this inside a recursive template. xsl:analyze-string can do everything that tokenize can do; you could implement tokenize as

<xsl:function name="fn:tokenize" as="xs:string">
  <xsl:param name="in" as="xs:string"/>
  <xsl:param name="regex" as="xs:string"/>
  <xsl:analyze-string select="$in" regex="{$regex}"/>
    <xsl:matching-substring/>
    <xsl:non-matching-substring>
       <xsl:sequence select="."/>
    </xsl:non-matching-substring>
</xsl:function>

Start be replacing your call to tokenize with a call to that function, then add whatever functionality you need.

Michael Kay
Saxonica

On 17/07/2012 14:02, Matthieu Ricaud-Dussarget wrote:
Hi all,

I'm tokenizing some text within a reccursiv template. The goal is to generates some linking with some "definitions" inside the doc.
Let say my text is : "my foo bar"
=> 1st level of reccursion is searching for "bar" as defined anchor in the doc
if not found, I increase a $lookBacklevel param :
=> 2nd level of reccursion is searching for "foo bar"
and so on... till it finds a matching definition or throw an error if not.
=> when a definition is found, the text is output with a link :
<p>... my <link idref="#anchorFooBar">foo bar</link> ...</p>


To do so I (space-) tokenized the text :
<xsl:variable name="tokenText" select="tokenize($text,' ')" as="xs:string*"/>


and then make 2 strings depending on reccursion param $lookBacklevel
<xsl:variable name="textBegin" select="string-join($tokenText[position() lt ($tokenNum - $lookBacklevel + 1)],' ')"/>
<xsl:variable name="textEnd" select="string-join($tokenText[position() ge ($tokenNum - $lookBacklevel + 1)],' ')"/>


I then search for a matching definition :
<xsl:variable name="matchingAncres" select="$ancres[normalize-space($textEnd)!=''][igs:match-ancre(.,$textEnd)]" as="element()*"/>
(matching rules are defined in a specific function)


The problem I've got is that the tokenize separator is too specific, it's only a space, and sometime words are separated by other char like :
- unbreakable space "&#160;"
- open parenthese "("
- french quotes "B+"
- ...


I could use a regex like "[\s(]B+" as 2nd arg of tokenize() but, I will then not be able to reconstruct the string.

So is there a way to get the separator that has been match in the regex of tokenize() ?
just like regex-group() do when using <xsl:analyze-string> ?


I think the answer is "no", but maybe I'm missing a trick to achieve this ?

I could maybe use <xsl:analyse-string> but this is not so easy because of the reccursiv template, the regex will depend on $lookBacklevel param. I'm not sure I can fin the good pattern...

Regards,

Matthieu.

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.