[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Need an XPath expression which returns all xs:patt

Subject: Re: Need an XPath expression which returns all xs:pattern elements containing a regex that permits an unbounded number of characters
From: "C. M. Sperberg-McQueen cmsmcq@xxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 4 Apr 2024 18:05:08 -0000
Re:  Need an XPath expression which returns all xs:patt
You seem to be close to a reasonably good solution already.

Unless I'm missing something, you've identified the only four ways that
a regular expression can match an unbounded number of characters: the *
and + operators, and a quantifier with a comma but no second argument.
That's a good start, I think.  Either of the first two can be escaped
with a single backslash, and none of them has a meaning as a quantifier
within a square-bracketed character-class expression.

The simplest first approximation would be very like the one you have
already tried: search for "*" or '+' or ',}' (it's a mistake to search
for '{1,}' or '{0,}' because an expression like "a{4,}" also matches
strings of unbounded length; I am assuming you don't know in advance
that the only minimum values used in numeric quantifiers are 0 and 1).

So something like:

    xs:pattern[matches(@value, "[*+]|,\}")]

As you have noticed, that pulls up false positives like '\*'.

A better approximation would be to search for any of:

  - '*' when not preceded by a backslash    
  - '+' when not preceded by a backslash
  - ',}'

I believe the string ",}" can appear in a legal XSD regular expression
only as part of a quantifier:  "\,}" would escape the comma, but the
right bracket is not allowed without an escape, so an escaped form would
be ",\}", which won't match the string ",}".

So something like:

    xs:pattern[matches(@value, "[^\\][*+]|,\}")]

This second approximation will eliminate some false positives, but it
will still return a false positive on a pattern like "[?*+{,}]?", since
the characters of interest to us need not be escaped within a character
class expression.  It also will produce a false negative on "\\*", which
matches any number of backslash characters.

A third approximation would ensure that we don't match * or + after a
single backslash, or between (unescaped) left and right square brackets,
by first imagining a simple finite state automaton and then translating
it into a regular expression.

  - in the NORMAL state:
    . a star or plus takes us to state MATCH
    . a comma takes us to state COMMA
    . a backslash takes us to state ESC
    . a left bracket takes us to state LB
    . anything else leaves us in state NORMAL
  - in state COMMA
    . a right brace takes us to MATCH
    . anything else takes us to NORMAL
  - in state ESC
    . any character takes us to NORMAL
  - in state LB
    . a right bracket takes us to NORMAL
    . a backslash takes us to state LBESC
    . anything else leaves us in state LB
  - in state LBESC
    . any character takes us to state LB

So: the regex should allow any number of excursions to state COMMA, ESC,
or LB, followed by one of the strings we are looking for:

  "((,[^}])|(\\.)|(\[(\\.)*[^\]]\]))*([*+]|,\})"

Since character class expressions can nest in XSD, you can have
expressions like

  [\p{L}-[a-z]]

which means that in principle square brackets can nest arbitrarily deep,
and you would have to keep a stack in order to know reliably when you
get back to the normal state, outside of all square bracket pairs.  But
since a nested character class expression can occur only as the last
child of its parent, you don't need to keep track in practice:  as soon
as you see the first unescaped right bracket in state LB, you will in
any well formed expression see a series of right brackets. None of them
will match star, plus, or comma-right-brace, so there is no need to keep
a stack.

Note, however, that matching braces in XPath is complicated by the fact
that they often have special meaning in XPath.  If you can find a good
explanation of the escaping rules, read it before you try to make the
expression above work.  (If it were me, I'd place a bet on the sequence
comma plus right brace never occurring within a character class
expression, and use regex matching to deal with the escaping of * and +
and just use contains() to look for occurrences of ",}".  Of course, if
it were me, the cost of false positive here and there would be low --
your mileage may differ.)

I hope this helps.

"Roger L Costello costello@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> writes:

> Hi Folks,
>
> I want to find, in an XML Schema, all xs:pattern elements containing a regex that permits an unbounded number of characters.
>
> Here are examples of xs:pattern elements that I want to find:
>
> <xs:pattern value="A*"/>
> <xs:pattern value="A+"/>
> <xs:pattern value="A{0,.}"/>
> <xs:pattern value="A{1,.}"/>
>
> I do not want either of the following xs:pattern elements because -- due to the escape symbol -- they do not permit an unbounded number of characters:
>
> <xs:pattern value="A\*"/>
> <xs:pattern value="A\+"/>
>
> I created an XPath 2.0 expression to find the desired xs:pattern elements:
>
> xs:pattern[
>         contains(@value, '*') or 
>         contains(@value, '+') or 
>         contains(@value, '{1,}') or 
>         contains(@value, '{0,}')
>     ]
>
> Eek! That is not correct. It incorrectly returns the xs:pattern elements with escaped asterisk and escaped plus symbols:
>
> <xs:pattern value="A\*"/>
> <xs:pattern value="A\+"/>
>
> How to fix my XPath expression? Is the solution to add a second predicate:
>
> xs:pattern[
>         contains(@value, '*') or 
>         contains(@value, '+') or 
>         contains(@value, '{1,}') or 
>         contains(@value, '{0,}')
>     ][
>         not(contains(@value, '\*')) and
>         not(contains(@value, '\+'))
>     ]
>
> Is that correct? Is that the best approach? Is there a better approach?
>
> Bonus points if you can answer this question: Is my XPath expression catching all xs:pattern elements that have a regex that permits an unbounded number of characters?
>
> Note: For reasons that I will not explain, the XPath expression must be an XPath 2.0 expression.
>
> /Roger
>
>
>  
> 


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.