# Re: Regular expression functions (Was: Re: comments on

 Subject: Re: Regular expression functions (Was: Re: comments on December F&O draft) From: Jeni Tennison Date: Sat, 12 Jan 2002 16:07:15 +0000
Hi Marc,

> thinking about doing it... the reality of the regex engine is that
> while you would like the tree to appear somewhere in memory to
> select on, the way the regex engine will offer you the matchresult
> will be more in an array like format.

Yes, but when you talk about "the regexp engine" you're talking about
a specific implementation with a specific API. I don't see any reason
why a different implementation with a different API shouldn't offer
you a different view.

And I'd argue that we're going to need regular expression engines
specific to XSLT anyway (for support in XSLT, I know that your
concerns aren't necessarily the same) - regular expression engines for
XML Schema regular expressions aren't going to cover everything we
need for XSLT (e.g. start-string and end-string meta characters), and
other regexp engines won't support the XML Schema stuff.

But even having said that, perhaps you can get around it with some
jiggery-pokery. Say you had $number defined as the regular expression: ([0-9]+)(\.([0-9]+))? and you had a regular expression of: [+-]?($number)([Ee]([+-][0-9]+))?

Matching against:

-12.34E+5

If you simply substitute in the $number expression to get a result expression: [+-]?(([0-9]+)(\.([0-9]+))?)([Ee]([+-][0-9]+))? Then run the match on that. The result of the match could come in pairs of start/end integers, giving the position of the start and end of the match as a whole and each subexpression, something akin to (assuming counting starts from 0): ( (0, 9), (1, 6), (1, 3), (4, 6), (6, 9), (7, 9), (8, 9) ) You can tell whether one subexpression is 'inside' another subexpression by the fact that its start position is equal to or greater than the start position of the parent position, and its end position is equal to or less than the start position of the parent position. So from that you should be able to construct the tree: <rxp:match> - <rxp:match> <rxp:match>12</rxp:match>.<rxp:match>34</rxp:match> </rxp:match> <rxp:match>E<rxp:match>+5</rxp:match></rxp:match> </rxp:match> You can work out which of these rxp:match elements should actually be a 'number' element by counting the number of open-brace characters before the variable reference in the regular expression. In the case of: [+-]?($number)([Ee]([+-][0-9]+))?

it's the first one, so the first rxp:match element (inside the global
match) should be a number element instead:

<rxp:match>
-
<number>
<rxp:match>12</rxp:match>.<rxp:match>34</rxp:match>
</number>
<rxp:match>E<rxp:match>+5</rxp:match></rxp:match>
</rxp:match>

I'll note that the *big* drawback with this algorithm is that it
doesn't take into account nested subexpressions. Say you had something
like:

$para => \\para\{($phrase)\}
$phrase => (($bold)|($italic)|([^\\]*))*$bold   =>  \\bold\{($phrase)\}$italic =>  \\italic\{($phrase)\} to match a string like: \para{\italic{this} is \bold{bold \italic{and italic}} text.} This is essentially David's problem. You can't just expand all the variable subexpressions because they'd go on for ever. But without expanding all the variable subexpressions, you can't successfully backtrack if you suddenly find that a variable subexpression isn't in fact matched. For example, if you had:$baz     =>  \\baz\{(($barfoo)|($barbar))\}
$barbar => \\bar\{($bar)\}
$barfoo => \\bar\{($foo)\}
$foo => foo And a string like: \baz{\bar{\bar{foo}} We want to get: <baz> \baz{ <barbar>\bar{<barfoo>\bar{<foo>foo</foo>}</barfoo>}</barbar> } </baz> We can start off by matching the string against$baz and working out
that to match it, then the regular expression:

(($barfoo)|($barbar))\}

must match the string:

\bar{\bar{foo}}}

and that means that the following regular expression must match the
string:

((\\bar\{($foo)\})|(\\bar\{($bar)\}))\}

but there's no way to tell whether the \bar at the beginning of the
string should match the \\bar at the beginning of $barfoo or of$barbar.

That might be an acceptable limitation, or it might be something that
can be worked around, I'm not sure. I need to think about it some
more.

> the feeling I currently have is that this nested-regex to tree
> vision directly has the power of circumventing the need for the
> nested matcher structure alltogether... meaning with this kind of
> tree view on the regex-match-result one can hand it over to the
> regular-and-known xsl-node-juggling, I guess.

Quite - that's what I think makes it so powerful. If you get a tree
like the above, you could apply templates to it and so on. You have
one big regular expression that gets directly translated into a tree,
which you can then process using XSLT as usual.

>> I should note that nothing existing in XPath or XSLT automatically
>> creates a tree in this way.
>
> And this would be an argument to keep the nested-matcher approach.
> They behave more like the known templates that do create trees I
> guess?

Hey, why limit yourself to only doing things in the way that they're
done now?!? :)

But I think I'm still not very clear on your nested matcher approach.
Do you think you could use the examples above to illustrate how your
nested matcher would work?

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/

XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



### PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!