[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
[Recent Entries]
[Reply To This Message]
Re: tokenize() and regex-group ?
Subject: Re: tokenize() and regex-group ?
From: Matthieu Ricaud-Dussarget <matthieu.ricaud@xxxxxxxxx>
Date: Tue, 17 Jul 2012 16:54:13 +0200
|
Thank you Michael.
At first I used <xsl:analyze-string> before I realized I have to check
for one *or more* words to match the definition.
I thought using tokenize would help with going back reccursively from
word to word into the string (help by position() predicates).
But well I did not thought about problems with differents separator
pattern (the igs:match-ancre() function is permissive with this... but
the output tagging is not good, eg. <link idref="#foobar">B+ foo
bar*</link> B; shall better be B+ <link idref="#foobar">foo bar*</link> B;).
As usual you're right :-) I have to go back with <xsl:analyze-string>
The problem I suspected was about a regex witch match 1 or 2 or N words,
something like $wordRegex$sepRegex{{$lookBacklevel}}
After your emphazing fn:tokenize(), I finaly started with another way of
doing it with the help of af function that tokenize as XML :
<xsl:function name="igs:tokenize-as-xml" as="element()*">
<xsl:param name="string" as="xs:string"/>
<xsl:param name="regex" as="xs:string"/>
<xsl:variable name="tmp" as="element()*">
<xsl:analyze-string select="$string" regex="{$regex}">
<xsl:matching-substring>
<igs:sep><xsl:value-of select="."/></igs:sep>
</xsl:matching-substring>
<xsl:non-matching-substring>
<igs:text><xsl:value-of select="."/></igs:text>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:for-each-group select="$tmp"
group-adjacent="local-name(.)='sep'">
<xsl:choose>
<xsl:when test="current-grouping-key()">
<igs:sep><xsl:value-of
select="string-join(current-group(),'')"/></igs:sep>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:function>
Hope I can bind everything, I tell you about this when I finish.
Regards,
Matthieu.
Le 17/07/2012 15:22, Michael Kay a C)crit :
You need to use xsl:analyze-string. I don't understand the
difficulties in using this inside a recursive template.
xsl:analyze-string can do everything that tokenize can do; you could
implement tokenize as
<xsl:function name="fn:tokenize" as="xs:string">
<xsl:param name="in" as="xs:string"/>
<xsl:param name="regex" as="xs:string"/>
<xsl:analyze-string select="$in" regex="{$regex}"/>
<xsl:matching-substring/>
<xsl:non-matching-substring>
<xsl:sequence select="."/>
</xsl:non-matching-substring>
</xsl:function>
Start be replacing your call to tokenize with a call to that function,
then add whatever functionality you need.
Michael Kay
Saxonica
On 17/07/2012 14:02, Matthieu Ricaud-Dussarget wrote:
Hi all,
I'm tokenizing some text within a reccursiv template. The goal is to
generates some linking with some "definitions" inside the doc.
Let say my text is : "my foo bar"
=> 1st level of reccursion is searching for "bar" as defined anchor
in the doc
if not found, I increase a $lookBacklevel param :
=> 2nd level of reccursion is searching for "foo bar"
and so on... till it finds a matching definition or throw an error if
not.
=> when a definition is found, the text is output with a link :
<p>... my <link idref="#anchorFooBar">foo bar</link> ...</p>
To do so I (space-) tokenized the text :
<xsl:variable name="tokenText" select="tokenize($text,' ')"
as="xs:string*"/>
and then make 2 strings depending on reccursion param $lookBacklevel
<xsl:variable name="textBegin"
select="string-join($tokenText[position() lt ($tokenNum -
$lookBacklevel + 1)],' ')"/>
<xsl:variable name="textEnd"
select="string-join($tokenText[position() ge ($tokenNum -
$lookBacklevel + 1)],' ')"/>
I then search for a matching definition :
<xsl:variable name="matchingAncres"
select="$ancres[normalize-space($textEnd)!=''][igs:match-ancre(.,$textEnd)]"
as="element()*"/>
(matching rules are defined in a specific function)
The problem I've got is that the tokenize separator is too specific,
it's only a space, and sometime words are separated by other char like :
- unbreakable space " "
- open parenthese "("
- french quotes "B+"
- ...
I could use a regex like "[\s(]B+" as 2nd arg of tokenize() but, I
will then not be able to reconstruct the string.
So is there a way to get the separator that has been match in the
regex of tokenize() ?
just like regex-group() do when using <xsl:analyze-string> ?
I think the answer is "no", but maybe I'm missing a trick to achieve
this ?
I could maybe use <xsl:analyse-string> but this is not so easy
because of the reccursiv template, the regex will depend on
$lookBacklevel param. I'm not sure I can fin the good pattern...
Regards,
Matthieu.
--
Matthieu Ricaud
05 45 37 08 90
IGS-CP, service livres numC)riques
|
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format
RSS 2.0 |
|
Atom 0.3 |
|
|