[Home] [By Thread] [By Date] [Recent Entries]
Dimitre Novatchev wrote:
That it can't be done using a regular expression is one thing, that it can't be done with XSL-T 2.0 is incorrect, given that all one would need to check the validity of regular expressions is a turing complete language. I am unfamiliar with the term LR-1, and likely this is not what it should be, it is just a poor man's attempt to solve the OP's original inquiry: checking the validity of a regular expression. Just by glimpsing over my own solution, I see plenty of parts for improvement. But I didn't want to make it a week-long assignment, so consider it a start for testing the most common mistakes in your (Bryan's) regular expressions. As a guide, I used the W3 XML Schema specification and the XSLT 2.0 / XPath extensions. The solution the is laid out below supports the following: * Correct use of escapes * Correct use of parentheses and escaped parenteses * Escaped meta characters and correct pairs of unescaped ones (i.e., {..} and [...]) * Fully supports character classes, except for the unicode categories * Supports backreferences, but does not check for validity of them * Supports 'matches', but not 'replace/tokenize' (i.e., empty-matching is not considered wrong) * Supports all sorts of greedy/reluctant quantifiers and errors if specified incorrectly There's more to it, but I leave that as an exercise to the reader. In a nutshell, if you want to apply it, be aware that it checks the validity of the _syntax_, not the overall validity of the regular expression. That means that ' [z-a] ' validates correct, and ' (.)\2 ' will validate correct as well, but neither will compile. Run the stylesheet against anything, or with IT "main", and it will give you the following example output: This valid regex ' (a(\(b(c+?.d)))+ ' is asserted as: valid
This valid regex ' ((((((.).).)+?))+) ' is asserted as: valid
This valid regex ' a{1,12}b\{c?? ' is asserted as: valid
This valid regex ' ([^a-z123\n]+?)\1 ' is asserted as: valid
This invalid regex ' (a(b(c)) ' is asserted as: invalid
This invalid regex ' [abe\l] ' is asserted as: invalid
This invalid regex ' (a|b|c|d)\\\ ' is asserted as: invalid
This invalid regex ' .++ ' is asserted as: invalidAll in all, the build-up of the regex to check the regex is quite straight-forward. The only trouble is in the recursive part (which could do with some better writing). Call it as re:is-regex(....). Have fun with it. If you plan to add the category escapes to the list of supported features, let me know, please ;) Cheers, -- Abel Braaksma PS: it _can_ be done in regular expressions alone, but then you would loose some of the lexical analysis (paired grouping parentheses, mainly). <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xs = "http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:re = "http://regex" version="2.0"> <xsl:output method="text" /> <xsl:variable name="test-re"> <re assert="valid">(a(\(b(c+?.d)))+</re> <re assert="valid">((((((.).).)+?))+)</re> <re assert="valid">a{1,12}b\{c??</re> <re assert="valid">([^a-z123\n]+?)\1</re> <re assert="invalid">(a(b(c))</re> <re assert="invalid">[abe\l]</re> <re assert="invalid">(a|b|c|d)\\\</re> <re assert="invalid">.++</re> </xsl:variable> <xsl:template match="/" name="main"> <xsl:apply-templates select="$test-re/*" /> </xsl:template> <xsl:template match="re"> <xsl:value-of select=" 'This', @assert, 'regex ''', ., ''' is asserted as: ', ('invalid', 'valid')[1 + xs:integer(re:is-regex(current()))], '
'" /> </xsl:template> <!-- zero-or-one, zero-or-more, one-or-more etc (global because we need it in two functions) --> <xsl:variable name="quantifier" as="xs:string"> ( [?*+] <!-- q. mark, asterisk, plus sign --> | <!-- or --> \{[0-9]+,?[0-9]*\} <!-- {.. , ..} quantifier --> )\?? <!-- optional reluctant quantifier --> </xsl:variable> <!-- Test a regex. It is first 'normalized' by removing any whitespace. This may inadvertently make a valid regex invalid if an escaped whitespace is part of the regex. --> <xsl:function name="re:test-partial-regex" as="xs:boolean"> <xsl:param name="re" as="xs:string" /> <!-- a single char, not a meta char --> <xsl:variable name="char"> [^.\\?*+{}()|^$\[\]] </xsl:variable> <xsl:variable name="single-char-escape"> \\[nrt\\|.?*+(){}^\[\]$\-] </xsl:variable> <xsl:variable name="multi-char-escape"> (\\[sSiIcCdDwW] | \. | \\[0-9]+) </xsl:variable> <xsl:variable name="char-class-escape"> <xsl:text>(</xsl:text> <xsl:sequence select="$single-char-escape" /> <xsl:text>|</xsl:text> <xsl:sequence select="$multi-char-escape" /> <xsl:text>)</xsl:text> </xsl:variable> <!-- a single meta char --> <xsl:variable name="meta-char"> [nrt\\|.?*+(){}^\[\]$\sSiIcCdDwW-] </xsl:variable> <!-- a char that is allowed inside a character class expression --> <xsl:variable name="xml-char"> [^\\\[\]\-] </xsl:variable> <!-- base 'char' of a character class expression --> <xsl:variable name="char-or-esc"> <xsl:text>(</xsl:text> <xsl:sequence select="$xml-char" /> <xsl:text>|</xsl:text> <xsl:sequence select="$single-char-escape" /> <xsl:text>)</xsl:text> </xsl:variable> <!-- range in a character class expression --> <xsl:variable name="char-range"> <xsl:text></xsl:text> <xsl:sequence select="$char-or-esc" /> <xsl:text>-</xsl:text> <xsl:sequence select="$char-or-esc" /> </xsl:variable> <!-- a character class expression --> <!-- all that can be between [ and ] --> <xsl:variable name="char-class-expr"> \[ (\^|-)? <xsl:text>(</xsl:text> <xsl:sequence select="$char-or-esc" /> <xsl:text>|</xsl:text> <xsl:sequence select="$char-range" /> <xsl:text>)+</xsl:text> \] </xsl:variable> <!-- a character class is an escaped meta char, a char class expr or a wildcard expr --> <xsl:variable name="char-class"> <xsl:text>(</xsl:text> <xsl:sequence select="$char-class-escape" /> <xsl:text>|</xsl:text> <xsl:sequence select="$char-class-expr" /> <xsl:text>)</xsl:text> </xsl:variable> <!-- a char or an escaped meta char --> <xsl:variable name="atom" as="xs:string+"> <xsl:text>(</xsl:text> <xsl:sequence select="$char" /> <xsl:text>|</xsl:text> <xsl:sequence select="$char-class" /> <xsl:text>)</xsl:text> </xsl:variable> <!-- a piece has an atom and optional quantifier --> <xsl:variable name="piece"> <xsl:text>(</xsl:text> <xsl:sequence select="$atom" /> <xsl:text>)</xsl:text> <xsl:text>(</xsl:text> <xsl:sequence select="$quantifier" /> <xsl:text>)?</xsl:text> </xsl:variable> <!-- a branch has zero or more pieces (here: 1 or more, for optimizing the matching) --> <xsl:variable name="branch"> <xsl:text>(</xsl:text> <xsl:sequence select="$piece" /> <xsl:text>)+</xsl:text> </xsl:variable> <!-- a regexp has one or more branches --> <xsl:variable name="regexp" > <xsl:text>^$|^</xsl:text> <xsl:sequence select="$branch" /> <xsl:text>(\|</xsl:text> <xsl:sequence select="$branch" /> <xsl:text>)*</xsl:text> <xsl:text>$</xsl:text> </xsl:variable> <xsl:sequence select="matches($re, $regexp, 'x')" /> </xsl:function> <xsl:function name="re:is-regex" as="xs:boolean"> <xsl:param name="re" as="xs:string" /> <!-- variable with xs:boolean and xs:string results, the first contains results of inner regexes (inside parentheses) the second the remaining parenthesized string, which will recursively be re-applied --> <xsl:variable name="result-seq" as="xs:anyAtomicType*"> <!-- remove the escaped parentheses before applying regex --> <xsl:analyze-string select="replace($re, '\\\(|\\\)', '')" regex="\(([^()]*)\)({$quantifier})?" flags="x"> <xsl:matching-substring> <!-- test this deepest level --> <xsl:sequence select="re:test-partial-regex(regex-group(1))" /> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:sequence select="xs:string(.)" /> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:variable> <xsl:sequence select=" (: if a non-regex partial match is found, exit early :) if(some $b in $result-seq satisfies $b instance of xs:boolean and xs:boolean($b) = false()) then false() else (: last string, no parentheses anymore :) if(count($result-seq) = 1 and $result-seq[1] instance of xs:string) then re:test-partial-regex($result-seq[1]) else (: more sub regexes available still, recursively apply :) if($result-seq[. instance of xs:string][1]) then re:is-regex(string-join($result-seq[not(. instance of xs:boolean)], '')) else every $b in $result-seq satisfies xs:boolean($b) = true()" /> </xsl:function> </xsl:stylesheet>
|

Cart



