[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Regular expression functions (Was: Re: comments on
David, >> \para{\italic{this} is \bold{bold \italic{and italic}} text.} > > Ohh looks just like TeX, we'll get you using that yet... I only did it 'cos I knew you'd like it ;) > I can think of two ways of attacking the above with regexp. > > * Plan A (which is the way I'd do it in emacs) is to > > have a regexp replace > > \(\\[a-z]*\){\([^{}*]\)} to <\1>\2</\1> > [snip] > That's fine but requires that either you consider the XML markup > just to be part of the string (which is what I did here but what we > want to avoid in XSLT) or that your regexps can match across mixed > content models ie instead of [^{}]* meaning any character other than > a brace you'd need something that says any character-or-node other > than a brace. Yeah, I think that's ugly. On the other hand, if it's easy to express, and easy to implement, then why not - the fact that it operates over the string is hidden by the implementation - we shouldn't have to worry about it (but perhaps it would be difficult for the implementation if the regular expression included < characters, like it would if it was parsing HTML...). We'd also need some way of expressing it in XSLT. I've been imagining that the tree generation goes on behind the scenes, so you don't get much control over what the tree structure looks like. That means you'd get nested rxp:match elements rather than bold and italic elements, but at least you get something that you can then do more work with. With what you have above, perhaps something like: <xsl:regexp-template match="\(\\[a-z]*\){\([^{}*]\)}"> <xsl:element name="current-match(1)"> <xsl:value-of select="current-match(2)" /> </xsl:element> </xsl:regexp-template> with: <xsl:apply-regexp-templates select="'\para{\italic{this} is \bold{bold \italic{and italic}} text.}' /> The processor finds the regexp template that matches the longest substring from the string you apply regexp templates to. The substring is replaced by whatever you put in the content of the regexp template to create a sequence (of strings and nodes). The processor then tries to match on the concatenation of the string values of the items in this new sequence. It finds the template that matches the longest substring that is either a substring of a text node descendant of an element in the original sequence or wholly contains an element. (In other words, no overlap with element content.) Then you need to have another function that returns the matched string as a sequence of elements and text... ... but trying to work through this, I think this approach is doomed. Trying to deal with both the concatenated string value and the elements at the same time is just not worth the aggravation. > plan 2': > I suspect that one way to attack this in xslt2 is just to have two > simple regexp replaces > > \\\([a-z]*\){ -> <start name="\1"/> > > } -> <end/> > > so after doing the regexp matching I'd have: > > <start name="para"/><start name="italic"/>this<end/> is <start > name="bold"/>bold <start name="italic"/>and italic<end/><end/> > text.<end/> > > so now we've got rid of that flat string and replaced it by > something that's still flat but is mixed content with empty element > nodes and text. > > Getting from that flat mixed content to a hierarchical element tree > is just the famous xslt grouping problem which a typical Gumbie Cat > ought to be able to do in her sleep, especially if given the xslt2 > grouping constructs. Yes, good idea. I think that for this you just need a global replace on the entire string. The trouble, of course, is specifying it. You could do: <xsl:apply-regexp-templates select="'\para{\italic{this} is \bold{bold \italic{and italic}} text.}'" /> and have: <xsl:regexp-template match="(\\([a-z]*))\{"> <xsl:element name="{current-match(2)}" /> </xsl:regexp-template> <xsl:regexp-template match="\}"> <end /> </xsl:regexp-template> And the xsl:apply-regexp-templates returns a sequence consisting of whatever you get from the original string, with any matches substituted in. I'd say that any substring that got matched should be completely handled by the template that matches it - you use xsl:apply-regexp-templates to do further processing on a submatch in the string. You could characterise it with the algorithm: - take the match string - locate the regexp template that matches the longest substring within it - split the string into three: a. the string before the matched substring b. the matched substring c. the string after the matched substring - the result is a sequence containing the result of continuing to apply regexp templates to (a), followed by the result of the template that matches (b), followed by the result of continuing to apply regexp templates to (c). You should have modes on the regexp-templates, but I don't think there's any need for priorities (aside from the old import precendence thing - nothing from an imported stylesheet should take precendence over the importing stylesheet). The only trouble is that this doesn't enable matches of regular expressions that are generated on the fly (or at least not from local variables). I think you need an instruction - xsl:regexp-for-each, say, that does something pretty simple - iterates over alternate matched and unmatched substrings within a string - odd strings are strings that were unmatched, even strings are those that were matched. Again the current-match() function returns information about the subexpression matches, for the matched strings. So the equivalent of the above templates would be something like: <xsl:regexp-for-each select="'\para{\italic{this} is \bold{bold \italic{and italic}} text.}'" regexp="((\\([a-z]*))\{{)|(\}})"> <xsl:choose> <xsl:when test="position() mod 2 = 1"> <xsl:value-of select="." /> </xsl:when> <xsl:when test=". = '}'"> <end /> </xsl:when> <xsl:otherwise> <xsl:element name="{current-match(3)}" /> </xsl:otherwise> </xsl:choose> </xsl:regexp-for-each> The regexp attribute would be an attribute value template (although I have to say that mixing attribute value templates with regular expressions is pretty messy because {}s are used quite a lot within them, so perhaps an expression would be cleaner). Cheers, Jeni --- Jeni Tennison http://www.jenitennison.com/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|