A regular expression R is a sequence of characters that denote a set of strings L(R). When used to constrain a lexical space, a regular expression R asserts that only strings in L(R) are valid literals for values of that type.

A regular expression is composed from zero or more branches, separated by | characters.

Regular Expression

F regExp ::= branch ( '|' branch )*

1 For all branches S, and for all regular expressions T, valid regular expressions R are: Denoting the set of strings L(R) containing:

center11(empty string)	center11the set containing just the empty string
center11S	center11all strings in L(S)
center11S\|T	center11all strings in L(S) and all strings in L(T)

A branch consists of zero or more pieces, concatenated together.

Branch

F branch ::= nt-piece*

1 For all pieces S, and for all branches T, valid branches R are: Denoting the set of strings L(R) containing:

center11S	center11all strings in L(S)
center11ST	center11all strings st with s in L(S) and t in L(T)

A piece is an atom, possibly followed by a quantifier.

Piece

F piece ::= nt-atom nt-quantifier?

1 For all atoms S and non-negative integers n, m such that n <= m, valid pieces R are: Denoting the set of strings L(R) containing:

center11S	center11all strings in L(S)
center11S?	center11the empty string, and all strings in L(S).
center11S*	center11 All strings in L(S?) and all strings st with s in L(S)* and t in L(S). ( all concatenations of zero or more strings from L(S) )
center11S+	center11 All strings st with s in L(S) and t in L(S). ( all concatenations of one or more strings from L(S) )*
center11S{n,m}	center11 All strings st with s in L(S) and t in L(S{n-1,m-1}). ( All sequences of at least n, and at most m, strings from L(S) )
center11S{n}	center11 All strings in L(S{n,n}). ( All sequences of exactly n strings from L(S) )
center11S{n,}	center11 All strings in L(S{n}S) ( All sequences of at least n, strings from L(S) )*
center11S{0,m}	center11 All strings st with s in L(S?) and t in L(S{0,m-1}). ( All sequences of at most m, strings from L(S) )
center11S{0,0}	center11 The set containing only the empty string

NOTE:
The regular expression language in the Perl Programming Language [Perl] does not include a quantifier of the form S{,m), since it is logically equivalent to S{0,m}. We have, therefore, left this logical possibility out of the regular expression language defined by this specification. We welcome further input from implementors and schema authors on this issue.

A quantifier is one of ?, *, +, {n,m} or {n,}, which have the meanings defined in the table above.

Quanitifer

F	`quantifier`	::=	`[?*+] \| ( '{' nt-quantity '}' )`
F	`quantity`	::=	`nt-quantRange \| nt-quantMin \| nt-QuantExact`
F	`quantRange`	::=	`nt-QuantExact ',' nt-QuantExact`
F	`quantMin`	::=	`nt-QuantExact ','`
F	`QuantExact`	::=	`[0-9]+`

An atom is either a normal character, a character class, or a parenthesized regular expression.

Atom

F atom ::= nt-Char | nt-charClass | ( '(' nt-regExp ')' )

1 For all normal characters c, character classes C, and regular expressions S, valid atoms R are: Denoting the set of strings L(R) containing:

center11c	center11the single string consisting only of c
center11C	center11all strings in L(C)
center11(S)	center11all strings in L(S)

A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. These characters have special meanings in regular expressions, but can be escaped to form atoms that denote the sets of strings containing only themselves, i.e., an escaped metacharacter behaves like a normal character.

A normal character is any XML character that is not a metacharacter. In regular expressions, a normal character is an atom that denotes the singleton set of strings containing only itself.

Normal Character

F Char ::= [^.\?*+()|#x5B#x5D]

Note that a normal character can be represented either as itself, or with a [character reference].

[Next Chapter] [Home]

Table of contents

Appendices

F Regular Expressions