[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: lookaheads in XSLT2 regexes

Subject: Re: lookaheads in XSLT2 regexes
From: "Imsieke, Gerrit, le-tex" <gerrit.imsieke@xxxxxxxxx>
Date: Tue, 02 Mar 2010 00:10:49 +0100
Re:  lookaheads in XSLT2 regexes
The current situation isn't satisfactory. Expressing a seemingly simple regex such as '\bstring\b' in an XSD compliant regex seems tedious for at least two reasons:
- You are not allowed to resort to lookahead and lookback assertions for whitespace. Instead you have to expressly include the unwanted characters (or character classes) in your regex and sort them out later from the results (by means of grouping and backreferencing the wanted string).
- You'll have to include the string/line start/end anchors ^ and $ explicitly in your regex, making the expression even more complex or verbose. (Note that the Perlish \b doesn't only match where \w and \W meet, but also when \w is at the beginning or end of a string or line.)


I think \w is already defined in a pragmatic, not-so-arbitrary and (as far as I can see) locale-independent way: In http://www.w3.org/TR/xmlschema-2/ (hencefore in the XPath regex functions), it is [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}].

In addition to locations where \w and \W characters collide, \b should match \w at the start or at the end of the respective string/line.

Of course parameterized \w or \b tokens that accept any set of character classes or character ranges might be useful in certain situations. But this flexibility shouldn't have as its downside that you have to repeatedly use verbose expressions, e.g., \b{[\p{Ll}\p{Lu}&#x2011;]}.

The "word constituent character list" should rather be globally definable for the current XSLT/XQuery document, maybe as a stylesheet attribute extension (until the feature is in the spec):
<xsl:stylesheet ... saxon:word-constituents="[\p{Ll}\p{Lu}&#x2011;]">.


But for at least 80% of the real-life cases, the default XSD \w definition may be used when implementing \b. Perhaps that's not the one that Java uses for \w and \b, so it might not come with Saxon for free.

Gerrit



On 01.03.2010 18:52, Michael Kay wrote:

I didn't realise we were missing \b -- we should add it, if that's the case.


I think it was omitted deliberately, on the grounds that it's locale-sensitive. It's defined in Perl as matching "a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order)", where \w matches a "word" character (defined as "alphanumeric" plus "_"), in which "the list of alphabetic characters generated by \w is taken from the current locale". That's not an acceptable definition for our purposes, so it's arguably better to have no definition at all.

We could perhaps define \w to match "alphanumeric" as the term is used in
xsl:number (categories Nd, Nl, No, Lu, Ll, Lt, Lm or Lo) and then it's a
well-defined concept, though not necessarily one that matches user
expectations.

The fact that Perl overloads \b to mean backspace when within a character
class doesn't help.

And one feels that if it's useful to have a metacharacter that matches the
spot between a character in one character class and a character in its
complement, then one ought to generalize the concept so it works with any
character class, not just the rather arbitrary class containing Nd, Nl, No,
Lu, Ll, Lt, Lm and Lo.

Regards,

Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay


-- Gerrit Imsieke Geschdftsf|hrer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vvckler

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.