[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: [XSLT2.0] xsl:analyze-string@regex syntax too limi

Subject: RE: [XSLT2.0] xsl:analyze-string@regex syntax too limited
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Wed, 15 Dec 2004 19:48:21 -0000
regex translator
> Hi, just FYI, I have made a petition to the XSLT and XPath 2.0
> public comments list to remove most of the artificial restrictions
> on the regex syntax in the match, replace functions and 
> analyze-string instructions. 

It doesn't seem to be there yet...

Please note there's no need to comment separately on the two documents. XSLT
will automatically pick up any changes made to the XPath functions.
> 
> Michael Kay had to add a pretty complex piece of code to his 
> Saxon processor just to cripple the available regex syntax which
> was previously supported. That's ridiculous.
> 

It's very unlikely that XPath will support the whole of the Java regex
syntax, for example the POSIX character classes won't get past the I18N
scrutineers. Also, Java regexes match 16-bit UTF16 values, not Unicode
characters: so given a character outside the BMP, it counts as two
characters in a Java regex but as one character in an XPath regex - a lot of
the regex translation code in Saxon is designed to handle such differences,
not to remove functionality. So any changes to the XPath syntax won't remove
the need for the regex translator. (The translator, incidentally, was
written by James Clark to implement the XML Schema regex syntax, and I
extended it to handle the XPath extensions.)

As I've commented elsewhere, one of the main difficulties in "adding back"
further Perl regex features is the need to write an unambiguous
specification that is consistent with existing implementations. Writing a
spec that turns out to be inconsistent with existing implementations would
obviously be a disaster. This always turns out to be more difficult than you
think. To take just one example that you want to add, in Perl:

" A word boundary (`\b') is a spot between two characters
that has a `\w' on one side of it and a `\W' on the other
side of it (in either order), counting the imaginary char-
acters off the beginning and end of the string as matching
a `\W'.	(Within character classes `\b' represents
backspace rather than a word boundary, just as it normally
does in any double-quoted string.) 

Firstly, that's too informal for the WGs to accept it as written (what is a
"spot"? what is an "imaginary character"). Secondly, Perl classifies \b as a
"zero-width assertion" but it doesn't say clearly where in the overall
scheme of things a zero-width assertion can appear. Thirdly the exception
doesn't apply, because backspace isn't a legal XML character. So getting an
agreed spec just for \b could easily take an hour of WG time, and the WG is
getting pretty impatient about proposals that consume time unless there is a
problem that absolutely must be solved.

Just warning you...

Michael Kay
http://www.saxonica.com/

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.