Select one of these online reference categories:
The XML Schema Complete Reference
Buy the book!
|
XML Schema - Regular Expressions
A regular expression is a pattern for identifying a range of string
values. This pattern conforms to a specific grammar. The Schema Recommendation
suggests that an XML validator should implement "Level 1" regular expressions
as defined in the Unicode
Regular Expression Guidelines.
In this text, the term expression (without "regular") indicates a
regular expression snippet, or a subset of a regular expression. An expression
may match one or many characters. An expression may comprise an entire regular
expression.
XML Schema regular expressions are similar to other well-known regular
expressions, such as might be found in UNIX or Perl.
Examples:
- An XML schema that contains lots of regular expression examples.
- A corresponding XML instance
that contains one or more examples of each regular expression in the XML
schema.
The following links (to this page) reference tables that contain
information about all of the special regular expression characters:
| . | Match any character as defined by The Unicode Standard. | a.c | "aXc" "a9c" |
| \ | Precedes a metacharacter (to specify that character) or specifies a single- or multiple-character escape sequence. | \*\d*\* | "*1234*" |
| ? | Zero or one occurrences. | ab?c | "ac" "abc" |
| * | Zero or more occurrences. | ab*c | "ac" "abc" "abbbbbc" |
| + | One or more occurrences. | ab+c | "abc" "abbbbbc" |
| | | The "or" operator | ab|cd | "ab" "cd"
|
| ( | Start grouping. | a(b|c)d | "abd" "acd" |
| ) | End grouping. | a(b|c)d | "abd" "acd" |
| [ | Start range. | xx[A-Z]*xx | "xxABCDxx" |
| ] | End range. | xx[A-Z]*xx | "xxABCDxx" |
A regular expression that contains the majority of characters that one normally types into a keyboard (e.g., 'qwerty') matches exactly those characters. The rest of
this page describes special escape sequences and more. Be aware that negating
a regular expression with "normal" characters might provide surprising
results. For example, The regular expression '[^A-Z]' matches, among other things, a Greek or Japanese letter.
| . | Any character
except '\n' (newline) and '\r' (return). |
| \s | Whitespace,
specifically '' (space), '\t' (tab), '\n' (newline) and
'\r' (return). |
| \S | Any character
except those matched by '\s'. |
| \i | The first
character in an XML identifier. Specifically, any letter, the
character '_', or the character ':', See the XML Recommendation for the
complex specification of a letter. This character represents a subset
of letter that might appear in '\c'. |
| \I | Any character
except those matched by '\i'. |
| \c | Any character
that might appear in the built-in NMTOKEN datatype. See the XML
Recommendation for the complex specification of a NameChar. |
| \C | Any character
except those matched by '\c'. |
| \d | Any Decimal
digit. A shortcut for '\p{Nd}'. |
| \D | Any character
except those matched by '\d'. |
| \w | Any character
that might appear in a word. A shortcut for
'[#X0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]' (all characters except the set of
"punctuation", "separator", and "other" characters). |
| \W | Any character
except those matched by '\w'. |
A regular expression can match a character by using a character category. The expression can be inclusive or exclusive of the character category. A regular expression must escape a character category. An inclusive character category that represents any uppercase letter looks like the following:
\p{Lu}
An exclusive category that represents any character except an uppercase letter looks like the following:
\P{Lu}
Note that inclusive requires a lowercase 'p', whereas exclusive requires an uppercase 'P'.
| L | Letter, Any | |
| Lu | Letter, Uppercase | |
| Ll | Letter, Lowercase | |
| Lt | Letter, Titlecase | |
| Lm | Letter, Modifier | |
| Lo | Letter, Other | |
| L | Letter, uppercase, lowercase, and
titlecase letters (Lu, Ll, and Lt) | Optional in The Unicode Standard; not supported by the Schema Recommendation. |
| M | Mark, Any | |
| Mn | Mark, Nonspacing | |
| Mc | Mark, Spacing Combining | |
| Me | Mark, Enclosing | |
| N | Number, Any | |
| Nd | Number, Decimal Digit | |
| Nl | Number, Letter | |
| No | Number, Other | |
| P | Punctuation, Any | |
| Pc | Punctuation, Connector | |
| Pd | Punctuation, Dash | |
| Ps | Punctuation, Open | |
| Pe | Punctuation, Close | |
| Pi | Punctuation, Initial quote (may behave like Ps or Pe, depending on usage) | |
| Pf | Punctuation, Final quote (may behave like Ps or Pe, depending on usage) | |
| Po | Punctuation, Other | |
| S | Symbol, Any | |
| Sm | Symbol, Math | |
| Sc | Symbol, Currency | |
| Sk | Symbol, Modifier | |
| So | Symbol, Other | |
| Z | Separator, Any | |
| Zs | Separator, Space | |
| Zl | Separator, Line | |
| Zp | Separator, Paragraph | |
| C | Other, Any | |
| Cc | Other, Control | |
| Cf | Other, Format | |
| Cs | Other, Surrogate (not supported by Schema Recommendation). | Explicitly not supported by Schema Recommendation. |
| Co | Other, Private Use | |
| Cn | Other, Not Assigned (no characters in the file have this property). | |
The Unicode Standard supports character blocks. A block is a range of characters set aside for a specific purpose. Some examples of these blocks are the characters for a language (such as Greek), the Braille character set, and various drawing symbols.
The
XML Schema Recommendation provides a regular expression mechanism for
identifying characters that belong to a specific block of interest. The syntax
for identifying a block is '\p{IsBlockName}', where 'BlockName' is a name from Table 14.13. Like the character categories, an uppercase 'P' (as in '\P{IsBlockName}') excludes the characters in that block.
An expression may match a character by using the common XML character
reference, which is a decimal number delimited by '&' and ';', or a hex number
delimited by '&#' and ';'. For example, the uppercase letter 'Z' is referenced
by the decimal representation 'Z' and the hex representation
'Z'. These numbers correspond directly to the characters documented in
The Unicode Standard.
|