Re: Regular expression functions (Was: Re: comments on

Play the video

Subject: Re: Regular expression functions (Was: Re: comments on December F&O draft)
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Sat, 12 Jan 2002 21:57:07 +0000

David,

>>   \para{\italic{this} is \bold{bold \italic{and italic}} text.}
>
> Ohh looks just like TeX, we'll get you using that yet...

I only did it 'cos I knew you'd like it ;)

> I can think of two ways of attacking the above with regexp.
>
> * Plan A (which is the way I'd do it in emacs) is to 
>
> have a regexp replace
>
> \(\\[a-z]*\){\([^{}*]\)}  to <\1>\2</\1>
>
[snip]
> That's fine but requires that either you consider the XML markup
> just to be part of the string (which is what I did here but what we
> want to avoid in XSLT) or that your regexps can match across mixed
> content models ie instead of [^{}]* meaning any character other than
> a brace you'd need something that says any character-or-node other
> than a brace.

Yeah, I think that's ugly. On the other hand, if it's easy to express,
and easy to implement, then why not - the fact that it operates over
the string is hidden by the implementation - we shouldn't have to
worry about it (but perhaps it would be difficult for the
implementation if the regular expression included < characters, like
it would if it was parsing HTML...).

We'd also need some way of expressing it in XSLT. I've been imagining
that the tree generation goes on behind the scenes, so you don't get
much control over what the tree structure looks like. That means you'd
get nested rxp:match elements rather than bold and italic elements,
but at least you get something that you can then do more work with.

With what you have above, perhaps something like:

<xsl:regexp-template match="\(\\[a-z]*\){\([^{}*]\)}">
  <xsl:element name="current-match(1)">
    <xsl:value-of select="current-match(2)" />
  </xsl:element>
</xsl:regexp-template>

with:

  <xsl:apply-regexp-templates
    select="'\para{\italic{this} is \bold{bold \italic{and italic}} text.}' />

The processor finds the regexp template that matches the longest
substring from the string you apply regexp templates to.

The substring is replaced by whatever you put in the content of the
regexp template to create a sequence (of strings and nodes).

The processor then tries to match on the concatenation of the string
values of the items in this new sequence.

It finds the template that matches the longest substring that is
either a substring of a text node descendant of an element in the
original sequence or wholly contains an element. (In other words, no
overlap with element content.)

Then you need to have another function that returns the matched string
as a sequence of elements and text...

... but trying to work through this, I think this approach is doomed.
Trying to deal with both the concatenated string value and the
elements at the same time is just not worth the aggravation.

> plan 2':
> I suspect that one way to attack this in xslt2 is just to have two
> simple regexp replaces
>
> \\\([a-z]*\){  -> <start name="\1"/>
>
> }              -> <end/>
>
> so after doing the regexp matching I'd have:
>
> <start name="para"/><start name="italic"/>this<end/> is <start
> name="bold"/>bold <start name="italic"/>and italic<end/><end/>
> text.<end/>
>
> so now we've got rid of that flat string and replaced it by
> something that's still flat but is mixed content with empty element
> nodes and text.
>
> Getting from that flat mixed content to a hierarchical element tree
> is just the famous xslt grouping problem which a typical Gumbie Cat
> ought to be able to do in her sleep, especially if given the xslt2
> grouping constructs.

Yes, good idea. I think that for this you just need a global replace
on the entire string. The trouble, of course, is specifying it. You
could do:

  <xsl:apply-regexp-templates
    select="'\para{\italic{this} is \bold{bold \italic{and italic}} text.}'" />

and have:
    
<xsl:regexp-template match="(\\([a-z]*))\{">
  <xsl:element name="{current-match(2)}" />
</xsl:regexp-template>

<xsl:regexp-template match="\}">
  <end />
</xsl:regexp-template>

And the xsl:apply-regexp-templates returns a sequence consisting of
whatever you get from the original string, with any matches
substituted in. I'd say that any substring that got matched should be
completely handled by the template that matches it - you use
xsl:apply-regexp-templates to do further processing on a submatch in
the string.

You could characterise it with the algorithm:

 - take the match string
 - locate the regexp template that matches the longest substring
   within it
 - split the string into three:
   a. the string before the matched substring
   b. the matched substring
   c. the string after the matched substring
 - the result is a sequence containing the result of continuing to
   apply regexp templates to (a), followed by the result of the
   template that matches (b), followed by the result of continuing to
   apply regexp templates to (c).

You should have modes on the regexp-templates, but I don't think
there's any need for priorities (aside from the old import precendence
thing - nothing from an imported stylesheet should take precendence
over the importing stylesheet).

The only trouble is that this doesn't enable matches of regular
expressions that are generated on the fly (or at least not from local
variables).

I think you need an instruction - xsl:regexp-for-each, say, that does
something pretty simple - iterates over alternate matched and
unmatched substrings within a string - odd strings are strings that
were unmatched, even strings are those that were matched. Again the
current-match() function returns information about the subexpression
matches, for the matched strings.

So the equivalent of the above templates would be something like:

  <xsl:regexp-for-each
    select="'\para{\italic{this} is \bold{bold \italic{and italic}} text.}'"
    regexp="((\\([a-z]*))\{{)|(\}})">

    <xsl:choose>
      <xsl:when test="position() mod 2 = 1">
        <xsl:value-of select="." />
      </xsl:when>
      <xsl:when test=". = '}'">
        <end />
      </xsl:when>
      <xsl:otherwise>
        <xsl:element name="{current-match(3)}" />
      </xsl:otherwise>
    </xsl:choose>
    
  </xsl:regexp-for-each>

The regexp attribute would be an attribute value template (although I
have to say that mixing attribute value templates with regular
expressions is pretty messy because {}s are used quite a lot within
them, so perhaps an expression would be cleaner).
    
Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

Current Thread

RE: Regular expression functions (Was: Re: comments on December F&O draft), (continued)

<- Previous	Index	Next ->
Re: Regular expression functions (W, David Carlisle	Thread	RE: Regular expression functions (W, Marc Portier
Re: Regular expression functions (W, David Carlisle	Date	Re: Re: Regular expression fu, Jeni Tennison
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >