F Regular Expressions
Regular Expressions
A regular expression R is a sequence of
characters that denote a set of strings L(R).
When used to constrain a lexical space, a
regular expression R asserts that only strings
in L(R) are valid literals for values of that type.
A
regular expression is composed from zero or more
branches, separated by | characters.
Regular Expression
1
For all branches S, and for all
regular expressions T, valid
regular expressions R are:
Denoting the set of strings L(R) containing:
| center11(empty string) |
center11the set containing just the empty string
|
| center11S |
center11all strings in L(S) |
| center11S|T |
center11all strings in L(S) and
all strings in L(T) |
A branch consists
of zero or more pieces, concatenated together.
Branch
1
For all pieces S, and for all
branches T, valid
branches R are:
Denoting the set of strings L(R) containing:
| center11S |
center11all strings in L(S) |
| center11ST |
center11all strings st with s in
L(S) and t in L(T) |
A piece is an
atom, possibly followed by a
quantifier.
Piece
1
For all atoms S and non-negative
integers n, m such that
n <= m, valid pieces
R are:
Denoting the set of strings L(R) containing:
| center11S |
center11all strings in L(S) |
| center11S? |
center11the empty string, and all strings in
L(S). |
| center11S* |
center11
All strings in L(S?) and all strings st
with s in L(S*)
and t in L(S). ( all concatenations
of zero or more strings from L(S) )
|
| center11S+ |
center11
All strings st with s in L(S)
and t in L(S*). ( all concatenations
of one or more strings from L(S) )
|
| center11S{n,m} |
center11
All strings st with s in L(S)
and t in L(S{n-1,m-1}). ( All
sequences of at least n, and at most m, strings from L(S) )
|
| center11S{n} |
center11
All strings in L(S{n,n}). ( All
sequences of exactly n strings from L(S) )
|
| center11S{n,} |
center11
All strings in L(S{n}S*) ( All
sequences of at least n, strings from L(S) )
|
| center11S{0,m} |
center11
All strings st with s in L(S?)
and t in L(S{0,m-1}). ( All
sequences of at most m, strings from L(S) )
|
| center11S{0,0} |
center11
The set containing only the empty string
|
NOTE:
The regular expression language in the Perl Programming Language
[Perl] does not include a quantifier of the form
S{,m), since it is logically equivalent to S{0,m}.
We have, therefore, left this logical possibility out of the regular
expression language defined by this specification. We welcome
further input from implementors and schema authors on this issue.
A quantifier
is one of ?, *, +,
{n,m} or {n,}, which have the meanings
defined in the table above.
Quanitifer
An atom is either a
normal character, a character class, or
a parenthesized regular expression.
Atom
1
For all normal characters c,
character classes C, and
regular expressions S, valid
atoms R are:
Denoting the set of strings L(R) containing:
| center11c |
center11the single string consisting only of c |
| center11C |
center11all strings in L(C) |
| center11(S) |
center11all strings in L(S) |
A metacharacter
is either ., \, ?,
*, +, {, }
(, ), [ or ].
These characters have special meanings in regular expressions,
but can be escaped to form atoms that denote the
sets of strings containing only themselves, i.e., an escaped
metacharacter behaves like a normal character.
A
normal character is any XML character that is not a
metacharacter. In regular expressions, a normal character is an
atom that denotes the singleton set of strings containing only itself.
Normal Character
| F | Char | ::= | [^.\?*+()|#x5B#x5D] |
Note that a normal character can be represented either as
itself, or with a [character
reference].
|