A better xsl:analyze-string

Play the video

Subject: A better xsl:analyze-string
From: Pavel Minaev <int19h@xxxxxxxxx>
Date: Thu, 20 Aug 2009 10:51:12 -0700

After some recent struggles with xsl:analyze-string, I would like to
share my thoughts on its current design, and how it could be improved
for specific scenarios.

On the surface, the construct seems to be very well suited for
tokenizing plain text input - indeed, judging from its semantics of
repeatedly applying a regex to the input string, this seems
deliberate. However, it is very inconvenient to figure out _what_
actually matched once it does matches. One either has to match the
current substring one more time against regex for each token in turn,
or make each token a separate group in xsl:analyze-string/@regex, and
see which of the groups is non-empty. Say I want to tokenize into
numbers, identifiers, and the rest, ignoring whitespace. I would have
to do something like this:

        <xsl:analyze-string select="'abc 123 foo 456'"
regex="(\s+)|(\d+(\.\d*)?)|(\w+)">
            <xsl:matching-substring>
                <xsl:choose>
                    <xsl:when test="regex-group(2) ne ''">
                        <xsl:text> number </xsl:text>
                        <xsl:value-of select="."/>
                    </xsl:when>
                    <xsl:when test="regex-group(4) ne ''">
                        <xsl:text> identifier </xsl:text>
                        <xsl:value-of select="."/>
                    </xsl:when>
                </xsl:choose>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:text> unknown </xsl:text>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>

This can get unwieldy really fast, because top-level regex groups for
tokens will often contain subgroups - even in the simple example above
this is already the case - and thus the indices of token groups are
not sequential; and, of course, there are no named groups in XSLT
regular expressions (which is something that might also come in
handy).

I was wondering - for a case like this (which, I would imagine, is
pretty common when parsing non-trivial non-XML data) it would've been
more convenient to let the instruction itself do the branching on
tokens. Syntactically, it could look like this:

        <xsl:analyze-string select="'abc 123 foo 456'">
            <xsl:matching-substring regex="\s+"/>
            <xsl:matching-substring regex="\d+(\.\d*)?">
                <xsl:text> number </xsl:text>
                <xsl:value-of select="."/>
            </xsl:matching-substring>
            <xsl:matching-substring regex="\w+">
                <xsl:text> identifier </xsl:text>
                <xsl:value-of select="."/>
            </xsl:matching-substring>
            ...
            <xsl:non-matching-substring>
                <xsl:text> unknown </xsl:text>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>

That is, an alternate form of xsl:analyze-string which doesn't have
@regex, but which contains one or more xsl:matching-substring
instructions that all have @regex on them. For every matched
substring, the mathcing-substring instruction with regex that was
matched is used. Otherwise, semantics are the same (context
item/position/size, prohibition on regexes that can match empty
strings, etc).

It has a fairly obvious direct translation to the existing syntax for
xsl:analyze-string, so this really is just syntactic sugar, and thus
would be easy to implement - in fact, it could be done entirely by an
XSLT transform. At the same time, I believe that it makes a fairly
important use case so much easier.

Your thoughts?

Current Thread
A better xsl:analyze-string Pavel Minaev - 20 Aug 2009 17:51:57 -0000 <= Michael Kay - 20 Aug 2009 21:40:02 -0000 Pavel Minaev - 20 Aug 2009 21:59:56 -0000 Michael Sokolov - 21 Aug 2009 00:29:55 -0000 Pavel Minaev - 21 Aug 2009 01:17:06 -0000

<- Previous	Index	Next ->
RE: Avoiding multiple "apply-, G. Ken Holman	Thread	RE: A better xsl:analyze-stri, Michael Kay
RE: Avoiding multiple "apply-, Michael Kay	Date	RE: Avoiding multiple "apply-, Wendell Piez
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >