[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Regular expression functions (Was: Re: comments on

Subject: Re: Regular expression functions (Was: Re: comments on December F&O draft)
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Sun, 13 Jan 2002 10:28:23 +0000
xsl match select
I wrote:
> I'll think some more...

And of course had an idea immediately I went to bed, and therefore
couldn't sleep...

In XSLT, you *select* a bunch of nodes to process, the processor goes
through them one by one, and you have templates that *match* those
nodes and provide whatever output you want for them. This has proved a
very flexible way of going about things, especially in cases where you
have deeply nested, unpredictable structures.

So to deal with strings that have deeply nested, unpredictable
structures (such as David's example), perhaps that same kind of
approach would work. You need a way of selecting a sequence of strings
and applying templates to them, where the templates have regular
expression patterns. Something along the lines of:

  <!-- Category: instruction -->
    select = string-sequence-expression
    mode   = qname>
    <!-- Content: (xsl:sort | xsl:with-param)* -->

You also need something for declaring regular expression templates
that match those strings. Something along the lines of:

  <!-- Category: declaration -->
    match    = regular-expression
    priority = number
    mode     = qname>
    <!-- Content: (xsl:param*, content-constructor) -->

When you use xsl:apply-regexp-templates, the processor goes through
the string sequence one string at a time in the sorted order (or
original order) and tries to find a template that matches the entire
string. It finds the highest-priority template that matches the entire
string (note that there are no implied priorities, so you have to use
the priority attribute if a string might match more than one
template), and uses that to create content. Modes and parameters work
in the usual way.

Within the xsl:regexp-template element, the context item is the string
that's matched by the template; the context position is its position
within the (sorted) string sequence to which regexp templates were
applied; the context size is the length of that sequence.

The evaluation context includes a current match, which is a sequence
of strings - the subexpressions from the match regular expression. You
can retrieve this sequence using the current-match() function.

[Or something along those lines - there are lots of possibilities
 for how you get hold of that information.]

Taking a simple example:

  <xsl:apply-regexp-templates select="'13/1/02'" mode="date" />

The processor applies templates to the date; there are multiple
templates in date mode (for different date formats), but the one that
matches with the highest priority is:

<xsl:regexp-template match="([0-9]{1,2})/([0-9]{1,2})/([0-9]{2})"
  <xsl:variable name="day"
                select="format-number(current-match()[1], '00')" />
  <xsl:variable name="month"
                select="format-number(current-match()[2], '00')" />
  <xsl:variable name="year"
                select="if (current-match()[3] > 30)
                        then (current-match()[3] + 1900)
                        else (current-match()[3] + 2000)" />
  <xsl:value-of select="($year, $month, $day)"
                separator="-" />

To supplement the template pattern, there should be an instruction
that merges the xsl:apply-regexp-templates and the

  <!-- Category: instruction -->
    select = string-expression
    regexp = regular-expression>
    <!-- Content: (xsl:sort*, content-constructor) -->

For simple cases like the above, this allows you to just do:

  <xsl:match select="'13/1/02'"
    <xsl:variable name="day"
                  select="format-number(current-match()[1], '00')" />
    <xsl:variable name="month"
                  select="format-number(current-match()[2], '00')" />
    <xsl:variable name="year"
                  select="if (current-match()[3] > 30)
                          then (current-match()[3] + 1900)
                          else (current-match()[3] + 2000)" />
    <xsl:value-of select="($year, $month, $day)"
                  separator="-" />

To make it easier to construct string sequences to which to apply
regular expression templates, I suggest a function (or two, perhaps,
given the general avoidance of function overloading) that basically
tokenises a string based on a regular expression. The signature of the
function would be:

  tokenize(string $string, string $regexp) => string*
  tokenize(string $string, string $start-regexp, string $end-regexp)
    => string*

The first form splits $string into a sequence of strings. Every even
string matches the $regexp. For example:

  tokenize(' foo  bar   baz', '\s+')
    => ('', ' ', 'foo', '  ', 'bar', '   ', 'baz')

The second form does a similar thing, except that the even-positioned
strings must begin with the $start-regexp and end with the
$end-regexp. What's more, each even string in the result must be
balanced - it must contain an equal number of substrings matching the
$start-regexp as match the $end-regexp (with no overlapping). For

  tokenize('this is \bold{bold \italic{and italic}} text',
           '\\[a-z]+\{', '\}')
    => ('this is ', '\bold{bold \italic{and italic}}', ' text')

Note that any odd string in the result may contain a substring that
matches the $end-regexp; similarly, the last string in the result may
start with a match for the $start-regexp, if there's no matching
$end-regexp. Also, in some strings the substring matching the
$start-regexp may overlap with the substring matching the $end-regexp.
To make it easier to manage formats like messy HTML, where you need
the $end-regexp to contain something from the $start-regexp,
$end-regexp can contain back references to subexpressions within
$start-regexp, in the form \1...\N. For example (not escaping <s for

  tokenize('this <img src="glyph.gif"> is <b>bold</b> text',
           '<([a-z]+)>', '</\1>')
    => ('this ', '<img src="glyph.gif"> is <b>bold</b> text')

The fact that the tokenize() function takes regular expression strings
means that it's possible to construct regular expressions on the fly.
The fact that you *can't* construct regular expressions with the other
regular expression constructs (they don't have attribute value
templates), means that they can be parsed when the processor first
reads the stylesheet rather than at runtime, which is good for
efficiency, I think, especially considering how many regular
expression templates you might have.

I think that the regular expressions in tokenize() give you all you
actually need. For example, to go through a piece of text and add an
em element around every occurrence of $keyword (as a whole word) in
the text, you could use:

    select="tokenize($text, concat('\W+', $keyword, '\W+'))">
      <xsl:when test="position() mod 2 = 1">
        <xsl:value-of select="." />
        <xsl:for-each select="tokenize(., '\W+')">
            <xsl:when test="position() mod 2 = 0">
              <xsl:value-of select="." />
                <xsl:value-of select="." />

But if you have a static regular expression (and you don't have to
worry about bracket balancing) it's simpler to use xsl:match or
xsl:apply-regexp-templates instead:

    select="tokenize($text, concat('\W+', $keyword, '\W+'))">
      <xsl:when test="position() mod 2 = 1">
        <xsl:value-of select="." />
        <xsl:match select="." regexp="(\W+)(.*)(\W+)">
          <xsl:value-of select="current-match()[1]" />
            <xsl:value-of select="current-match()[2]" />
          <xsl:value-of select="current-match()[3]" />

I think a lot of this could be refined, but that as a general approach
it might be feasible. Any thoughts?


Jeni Tennison

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

Current Thread


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.