[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Need an XPath expression which returns all xs:patt

Subject: Re: Need an XPath expression which returns all xs:pattern elements containing a regex that permits an unbounded number of characters
From: "C. M. Sperberg-McQueen cmsmcq@xxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 5 Apr 2024 15:01:52 -0000
Re:  Need an XPath expression which returns all xs:patt
"Willem Van Lishout willemvanlishout@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> writes:

> Is this even possible, theoretically speaking? As soon as you start
> using lookaheads, square brackets, and so on, your patterns will
> likely fail. I don't think regex can parse regex.

Good observation, quite correct.  Since the language of regular
expressions in XSD (and XPath) is context free, not a regular language,
it cannot be recognized in full by a regular expression.

There are, I think, two ways to deal with that fact:

1 In the general case, if we limit the maximum nesting depth of a
context-free language to some finite number, the language becomes
regular.  So we cannot write a regular expression to match all and only
the legal regular expressions of XSD or XPath, but we *can* write one
that recognizes all and only the ones with at most one level of nesting
for parentheses or square brackets.  We can also write one that
recognizes all and only the expressions with at most two, or three, or
any positive n, levels of nesting.  (The result is not guaranteed to be
pretty.)  For any finite set of inputs, there will be some maximum
nesting level observed; if you have written your regular expression to
handle nesting that deep, your regex will produce neither false
positives nor false negatives for that input.

How comforting this is depends on how costly a false positive or
negative would be, and how costly it would be to parse the regular
expressions in question using something like ixml.  (Hmm.  Good idea for
a sample ixml grammar!)

2 In this case, we can observe that the requirement is only to detect
patterns which allow strings of unbounded length, and thus regular
expressions which use quantifiers like *, +, or {n,} on any base
expression.  We can assume that all inputs are syntactically well
formed, so we do not need to distinguish well-formed from ill-formed
expressions.

Nesting of parentheses does not matter and can be ignored.  Nesting of
square brackets turns out also not to matter, since the characters *, +,
{, and } are effectively escaped at every possible nesting level in a
character class expression, and once the first right bracket is
encountered, no character other than right bracket can be encountered
until all open square bracket pairs have been closed.


So: no, theoretically speaking, it's not possible to recognize regular
expressions with regular expressions.  But practically speaking, *for
this problem* (a) an approximation is possible, and in fact better than
that (b) a correct solution is possible, because of patterns in the
language in question which we can see and capture.

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.