[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: lookaheads in XSLT2 regexes
The current situation isn't satisfactory. Expressing a seemingly simple
regex such as '\bstring\b' in an XSD compliant regex seems tedious for
at least two reasons:
- You are not allowed to resort to lookahead and lookback assertions for whitespace. Instead you have to expressly include the unwanted characters (or character classes) in your regex and sort them out later from the results (by means of grouping and backreferencing the wanted string). - You'll have to include the string/line start/end anchors ^ and $ explicitly in your regex, making the expression even more complex or verbose. (Note that the Perlish \b doesn't only match where \w and \W meet, but also when \w is at the beginning or end of a string or line.) I think \w is already defined in a pragmatic, not-so-arbitrary and (as far as I can see) locale-independent way: In http://www.w3.org/TR/xmlschema-2/ (hencefore in the XPath regex functions), it is [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]. In addition to locations where \w and \W characters collide, \b should match \w at the start or at the end of the respective string/line. Of course parameterized \w or \b tokens that accept any set of character classes or character ranges might be useful in certain situations. But this flexibility shouldn't have as its downside that you have to repeatedly use verbose expressions, e.g., \b{[\p{Ll}\p{Lu}‑]}. The "word constituent character list" should rather be globally definable for the current XSLT/XQuery document, maybe as a stylesheet attribute extension (until the feature is in the spec): <xsl:stylesheet ... saxon:word-constituents="[\p{Ll}\p{Lu}‑]">. But for at least 80% of the real-life cases, the default XSD \w definition may be used when implementing \b. Perhaps that's not the one that Java uses for \w and \b, so it might not come with Saxon for free. Gerrit On 01.03.2010 18:52, Michael Kay wrote:
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930 Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt, Dr. Reinhard Vvckler
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|