RE: [XSLT2.0] xsl:analyze-string@regex syntax too limi

Play the video

Subject: RE: [XSLT2.0] xsl:analyze-string@regex syntax too limited
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Wed, 15 Dec 2004 19:48:21 -0000

> Hi, just FYI, I have made a petition to the XSLT and XPath 2.0
> public comments list to remove most of the artificial restrictions
> on the regex syntax in the match, replace functions and 
> analyze-string instructions. 

It doesn't seem to be there yet...

Please note there's no need to comment separately on the two documents. XSLT
will automatically pick up any changes made to the XPath functions.
> 
> Michael Kay had to add a pretty complex piece of code to his 
> Saxon processor just to cripple the available regex syntax which
> was previously supported. That's ridiculous.
> 

It's very unlikely that XPath will support the whole of the Java regex
syntax, for example the POSIX character classes won't get past the I18N
scrutineers. Also, Java regexes match 16-bit UTF16 values, not Unicode
characters: so given a character outside the BMP, it counts as two
characters in a Java regex but as one character in an XPath regex - a lot of
the regex translation code in Saxon is designed to handle such differences,
not to remove functionality. So any changes to the XPath syntax won't remove
the need for the regex translator. (The translator, incidentally, was
written by James Clark to implement the XML Schema regex syntax, and I
extended it to handle the XPath extensions.)

As I've commented elsewhere, one of the main difficulties in "adding back"
further Perl regex features is the need to write an unambiguous
specification that is consistent with existing implementations. Writing a
spec that turns out to be inconsistent with existing implementations would
obviously be a disaster. This always turns out to be more difficult than you
think. To take just one example that you want to add, in Perl:

" A word boundary (`\b') is a spot between two characters
that has a `\w' on one side of it and a `\W' on the other
side of it (in either order), counting the imaginary char-
acters off the beginning and end of the string as matching
a `\W'.	(Within character classes `\b' represents
backspace rather than a word boundary, just as it normally
does in any double-quoted string.) 

Firstly, that's too informal for the WGs to accept it as written (what is a
"spot"? what is an "imaginary character"). Secondly, Perl classifies \b as a
"zero-width assertion" but it doesn't say clearly where in the overall
scheme of things a zero-width assertion can appear. Thirdly the exception
doesn't apply, because backspace isn't a legal XML character. So getting an
agreed spec just for \b could easily take an hour of WG time, and the WG is
getting pretty impatient about proposals that consume time unless there is a
problem that absolutely must be solved.

Just warning you...

Michael Kay
http://www.saxonica.com/

Current Thread
[XSLT2.0] xsl:analyze-string@regex syntax too limited Gunther Schadow - 15 Dec 2004 18:50:20 -0000 Michael Kay - 15 Dec 2004 19:48:45 -0000 <= Gunther Schadow - 15 Dec 2004 22:42:14 -0000 Colin Paul Adams - 16 Dec 2004 07:25:41 -0000 Michael Kay - 16 Dec 2004 09:19:37 -0000 Gunther Schadow - 16 Dec 2004 23:56:37 -0000

<- Previous	Index	Next ->
[XSLT2.0] xsl:analyze-string@, Gunther Schadow	Thread	Re: [XSLT2.0] xsl:analyze-str, Gunther Schadow
Re: Problem with encoding UTF, Barry Lay	Date	Re: [XSLT2.0] xsl:analyze-str, Gunther Schadow
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >