Re: [xsl] check the type of the $pattern argument to a regul

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

Subject: Re: check the type of the $pattern argument to a regular expression?
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Tue, 17 Apr 2007 01:36:08 +0200

Dimitre Novatchev wrote:

That it can't be
done using a regular expression is one thing, that it can't be done
with XSL-T 2.0 is incorrect, given that all one would need to check
the validity of regular expressions is a turing complete language.

In fact,  have a general purpose LR(1) parser written cmpletely in
XSLT 2.0, which will parse any input and return the sequence of
poduction used in the parse.

Provide any LR-1 grammar written in BNF and you'll get the checker.

I am unfamiliar with the term LR-1, and likely this is not what it should be, it is just a poor man's attempt to solve the OP's original inquiry: checking the validity of a regular expression. Just by glimpsing over my own solution, I see plenty of parts for improvement. But I didn't want to make it a week-long assignment, so consider it a start for testing the most common mistakes in your (Bryan's) regular expressions.

As a guide, I used the W3 XML Schema specification and the XSLT 2.0 / XPath extensions. The solution the is laid out below supports the following:

* Correct use of escapes * Correct use of parentheses and escaped parenteses * Escaped meta characters and correct pairs of unescaped ones (i.e., {..} and [...]) * Fully supports character classes, except for the unicode categories * Supports backreferences, but does not check for validity of them * Supports 'matches', but not 'replace/tokenize' (i.e., empty-matching is not considered wrong) * Supports all sorts of greedy/reluctant quantifiers and errors if specified incorrectly

There's more to it, but I leave that as an exercise to the reader. In a nutshell, if you want to apply it, be aware that it checks the validity of the _syntax_, not the overall validity of the regular expression. That means that ' [z-a] ' validates correct, and ' (.)\2 ' will validate correct as well, but neither will compile.

Run the stylesheet against anything, or with IT "main", and it will give you the following example output:

This valid regex ' (a(\(b(c+?.d)))+ ' is asserted as:  valid
This valid regex ' ((((((.).).)+?))+) ' is asserted as:  valid
This valid regex ' a{1,12}b\{c?? ' is asserted as:  valid
This valid regex ' ([^a-z123\n]+?)\1 ' is asserted as:  valid
This invalid regex ' (a(b(c)) ' is asserted as:  invalid
This invalid regex ' [abe\l] ' is asserted as:  invalid
This invalid regex ' (a|b|c|d)\\\ ' is asserted as:  invalid
This invalid regex ' .++ ' is asserted as:  invalid

All in all, the build-up of the regex to check the regex is quite straight-forward. The only trouble is in the recursive part (which could do with some better writing). Call it as re:is-regex(....).

Have fun with it. If you plan to add the category escapes to the list of supported features, let me know, please ;)

Cheers,
-- Abel Braaksma

PS: it _can_ be done in regular expressions alone, but then you would loose some of the lexical analysis (paired grouping parentheses, mainly).

<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xs = "http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:re = "http://regex" version="2.0"> <xsl:output method="text" /> <xsl:variable name="test-re"> <re assert="valid">(a(\(b(c+?.d)))+</re> <re assert="valid">((((((.).).)+?))+)</re> <re assert="valid">a{1,12}b\{c??</re> <re assert="valid">([^a-z123\n]+?)\1</re> <re assert="invalid">(a(b(c))</re> <re assert="invalid">[abe\l]</re> <re assert="invalid">(a|b|c|d)\\\</re> <re assert="invalid">.++</re> </xsl:variable> <xsl:template match="/" name="main"> <xsl:apply-templates select="$test-re/*" /> </xsl:template> <xsl:template match="re"> <xsl:value-of select=" 'This', @assert, 'regex ''', ., ''' is asserted as: ', ('invalid', 'valid')[1 + xs:integer(re:is-regex(current()))], '
'" /> </xsl:template>  <xsl:variable name="quantifier" as="xs:string"> ( [?*+]  |  \{[0-9]+,?[0-9]*\}  )\??  </xsl:variable>  <xsl:function name="re:test-partial-regex" as="xs:boolean"> <xsl:param name="re" as="xs:string" />  <xsl:variable name="char"> [^.\\?*+{}()|^$\[\]] </xsl:variable> <xsl:variable name="single-char-escape"> \\[nrt\\|.?*+(){}^\[\]$\-] </xsl:variable> <xsl:variable name="multi-char-escape"> (\\[sSiIcCdDwW] | \. | \\[0-9]+) </xsl:variable> <xsl:variable name="char-class-escape"> <xsl:text>(</xsl:text> <xsl:sequence select="$single-char-escape" /> <xsl:text>|</xsl:text> <xsl:sequence select="$multi-char-escape" /> <xsl:text>)</xsl:text> </xsl:variable>  <xsl:variable name="meta-char"> [nrt\\|.?*+(){}^\[\]$\sSiIcCdDwW-] </xsl:variable>  <xsl:variable name="xml-char"> [^\\\[\]\-] </xsl:variable>  <xsl:variable name="char-or-esc"> <xsl:text>(</xsl:text> <xsl:sequence select="$xml-char" /> <xsl:text>|</xsl:text> <xsl:sequence select="$single-char-escape" /> <xsl:text>)</xsl:text> </xsl:variable>  <xsl:variable name="char-range"> <xsl:text></xsl:text> <xsl:sequence select="$char-or-esc" /> <xsl:text>-</xsl:text> <xsl:sequence select="$char-or-esc" /> </xsl:variable>   <xsl:variable name="char-class-expr"> \[ (\^|-)? <xsl:text>(</xsl:text> <xsl:sequence select="$char-or-esc" /> <xsl:text>|</xsl:text> <xsl:sequence select="$char-range" /> <xsl:text>)+</xsl:text> \] </xsl:variable>  <xsl:variable name="char-class"> <xsl:text>(</xsl:text> <xsl:sequence select="$char-class-escape" /> <xsl:text>|</xsl:text> <xsl:sequence select="$char-class-expr" /> <xsl:text>)</xsl:text> </xsl:variable>  <xsl:variable name="atom" as="xs:string+"> <xsl:text>(</xsl:text> <xsl:sequence select="$char" /> <xsl:text>|</xsl:text> <xsl:sequence select="$char-class" /> <xsl:text>)</xsl:text> </xsl:variable>  <xsl:variable name="piece"> <xsl:text>(</xsl:text> <xsl:sequence select="$atom" /> <xsl:text>)</xsl:text> <xsl:text>(</xsl:text> <xsl:sequence select="$quantifier" /> <xsl:text>)?</xsl:text> </xsl:variable>  <xsl:variable name="branch"> <xsl:text>(</xsl:text> <xsl:sequence select="$piece" /> <xsl:text>)+</xsl:text> </xsl:variable>

 <xsl:variable name="regexp" > <xsl:text>^$|^</xsl:text> <xsl:sequence select="$branch" /> <xsl:text>(\|</xsl:text> <xsl:sequence select="$branch" /> <xsl:text>)*</xsl:text> <xsl:text>$</xsl:text> </xsl:variable> <xsl:sequence select="matches($re, $regexp, 'x')" /> </xsl:function> <xsl:function name="re:is-regex" as="xs:boolean"> <xsl:param name="re" as="xs:string" />  <xsl:variable name="result-seq" as="xs:anyAtomicType*">  <xsl:analyze-string select="replace($re, '\\$|\\$', '')" regex="$([^()]*)$({$quantifier})?" flags="x"> <xsl:matching-substring>  <xsl:sequence select="re:test-partial-regex(regex-group(1))" /> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:sequence select="xs:string(.)" /> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:variable> <xsl:sequence select=" (: if a non-regex partial match is found, exit early :) if(some $b in $result-seq satisfies $b instance of xs:boolean and xs:boolean($b) = false()) then false() else (: last string, no parentheses anymore :) if(count($result-seq) = 1 and $result-seq[1] instance of xs:string) then re:test-partial-regex($result-seq[1]) else (: more sub regexes available still, recursively apply :) if($result-seq[. instance of xs:string][1]) then re:is-regex(string-join($result-seq[not(. instance of xs:boolean)], '')) else every $b in $result-seq satisfies xs:boolean($b) = true()" /> </xsl:function> </xsl:stylesheet>

Current Thread
Re: check the type of the $pattern argument to a regular expression?, (continued) bryan rasmussen - 16 Apr 2007 14:02:20 -0000 Abel Braaksma - 16 Apr 2007 14:16:51 -0000 Michael Kay - 16 Apr 2007 22:50:09 -0000 Dimitre Novatchev - 16 Apr 2007 23:13:36 -0000 Abel Braaksma - 16 Apr 2007 23:38:42 -0000 <= Michael Kay - 16 Apr 2007 11:57:25 -0000

<- Previous	Index	Next ->
Re: check the type of the $pa, Dimitre Novatchev	Thread	RE: check the type of the $pa, Michael Kay
RE: XSLT Dead?, Michael Kay	Date	Re: XSLT Dead?, Robert Koberg
	Month

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >