XML Schema Reference

Folder Left Edge Island Home Folder Right Edge
Folder Left Edge Components Online Reference Folder Right Edge
Folder Left Edge Text Page Complete Reference Folder Right Edge
Folder Left Edge Text Page W3C Recs Folder Right Edge
Folder Left Edge Examples Examples Folder Right Edge
Folder Left Edge ? Consulting & Training Folder Right Edge

XML Schema - Regular Expressions

A regular expression is a pattern for identifying a range of string values. This pattern conforms to a specific grammar. The Schema Recommendation suggests that an XML validator should implement "Level 1" regular expressions as defined in the Unicode Regular Expression Guidelines.

In this text, the term expression (without "regular") indicates a regular expression snippet, or a subset of a regular expression. An expression may match one or many characters. An expression may comprise an entire regular expression.

XML Schema regular expressions are similar to other well-known regular expressions, such as might be found in UNIX or Perl.


Examples:

  • An XML schema that contains lots of regular expression examples.
  • A corresponding XML instance that contains one or more examples of each regular expression in the XML schema.

The following links (to this page) reference tables that contain information about all of the special regular expression characters:


XML Schema - Regular Expression - Meta Characters

Metacharacter Description Regular Expression Sample Match
. Match any character as defined by The Unicode Standard. a.c "aXc"
"a9c"
\ Precedes a metacharacter (to specify that character) or specifies a single- or multiple-character escape sequence. \*\d*\* "*1234*"
? Zero or one occurrences. ab?c "ac"
"abc"
* Zero or more occurrences. ab*c "ac"
"abc"
"abbbbbc"
+ One or more occurrences. ab+c "abc"
"abbbbbc"
| The "or" operator ab|cd "ab"
"cd"
( Start grouping. a(b|c)d "abd"
"acd"
) End grouping. a(b|c)d "abd"
"acd"
[ Start range. xx[A-Z]*xx "xxABCDxx"
] End range. xx[A-Z]*xx "xxABCDxx"

XML Schema - Regular Expressions - Individual Characters

XML Schema - Regular Expressions - Normal Characters

A regular expression that contains the majority of characters that one normally types into a keyboard (e.g., 'qwerty') matches exactly those characters. The rest of this page describes special escape sequences and more. Be aware that negating a regular expression with "normal" characters might provide surprising results. For example, The regular expression '[^A-Z]' matches, among other things, a Greek or Japanese letter.

XML Schema - Regular Expressions - Single Character Escape Sequence

Single Character Escape Sequence Description
\n New line character (
): line feed
\r Return character (
): carriage return
\t Tab character (	)
\\ \
\| |
\. .
\- -
\^ ^
\? ?
\* *
\+ +
\{ {
\} }
\( (
\) )
\[ [
\] ]

XML Schema - Regular Expressions - Multiple Character Escape Sequences

Multiple Character Escape Sequences Description
. Any character except '\n' (newline) and '\r' (return).
\s Whitespace, specifically '' (space), '\t' (tab), '\n' (newline) and '\r' (return).
\S Any character except those matched by '\s'.
\i The first character in an XML identifier. Specifically, any letter, the character '_', or the character ':', See the XML Recommendation for the complex specification of a letter. This character represents a subset of letter that might appear in '\c'.
\I Any character except those matched by '\i'.
\c Any character that might appear in the built-in NMTOKEN datatype. See the XML Recommendation for the complex specification of a NameChar.
\C Any character except those matched by '\c'.
\d Any Decimal digit. A shortcut for '\p{Nd}'.
\D Any character except those matched by '\d'.
\w Any character that might appear in a word. A shortcut for '[#X0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]' (all characters except the set of "punctuation", "separator", and "other" characters).
\W Any character except those matched by '\w'.

XML Schema - Regular Expressions - Character Categories

A regular expression can match a character by using a character category. The expression can be inclusive or exclusive of the character category. A regular expression must escape a character category. An inclusive character category that represents any uppercase letter looks like the following:

    \p{Lu}

An exclusive category that represents any character except an uppercase letter looks like the following:

    \P{Lu}

Note that inclusive requires a lowercase 'p', whereas exclusive requires an uppercase 'P'.

 

Character Category Description Notes
L Letter, Any  
Lu Letter, Uppercase  
Ll Letter, Lowercase  
Lt Letter, Titlecase  
Lm Letter, Modifier  
Lo Letter, Other  
L Letter, uppercase, lowercase, and titlecase letters (Lu, Ll, and Lt) Optional in The Unicode Standard; not supported by the Schema Recommendation.
M Mark, Any  
Mn Mark, Nonspacing  
Mc Mark, Spacing Combining  
Me Mark, Enclosing  
N Number, Any  
Nd Number, Decimal Digit  
Nl Number, Letter  
No Number, Other  
P Punctuation, Any  
Pc Punctuation, Connector  
Pd Punctuation, Dash  
Ps Punctuation, Open  
Pe Punctuation, Close  
Pi Punctuation, Initial quote (may behave like Ps or Pe, depending on usage)  
Pf Punctuation, Final quote (may behave like Ps or Pe, depending on usage)  
Po Punctuation, Other  
S Symbol, Any  
Sm Symbol, Math  
Sc Symbol, Currency  
Sk Symbol, Modifier  
So Symbol, Other  
Z Separator, Any  
Zs Separator, Space  
Zl Separator, Line  
Zp Separator, Paragraph  
C Other, Any  
Cc Other, Control  
Cf Other, Format  
Cs Other, Surrogate (not supported by Schema Recommendation). Explicitly not supported by Schema Recommendation.
Co Other, Private Use  
Cn Other, Not Assigned (no characters in the file have this property).  

XML Schema - Regular Expressions - Character Blocks

The Unicode Standard supports character blocks. A block is a range of characters set aside for a specific purpose. Some examples of these blocks are the characters for a language (such as Greek), the Braille character set, and various drawing symbols. The XML Schema Recommendation provides a regular expression mechanism for identifying characters that belong to a specific block of interest. The syntax for identifying a block is '\p{IsBlockName}', where 'BlockName' is a name from Table 14.13. Like the character categories, an uppercase 'P' (as in '\P{IsBlockName}') excludes the characters in that block.

Block Name Start Code End Code
BasicLatin #x0000 #x007F
Latin-1Supplement #x0080 #x00FF
LatinExtended-A #x0100 #x017F
LatinExtended-B #x0180 #x024F
IPAExtensions #x0250 #x02AF
SpacingModifierLetters #x02B0 #x02FF
CombiningDiacriticalMarks #x0300 #x036F
Greek #x0370 #x03FF
Cyrillic #x0400 #x04FF
Armenian #x0530 #x058F
Hebrew #x0590 #x05FF
Arabic #x0600 #x06FF
Syriac #x0700 #x074F
Thaana #x0780 #x07BF
Devanagari #x0900 #x097F
Bengali #x0980 #x09FF
Gurmukhi #x0A00 #x0A7F
Gujarati #x0A80 #x0AFF
Oriya #x0B00 #x0B7F
Tamil #x0B80 #x0BFF
Telugu #x0C00 #x0C7F
Kannada #x0C80 #x0CFF
Malayalam #x0D00 #x0D7F
Sinhala #x0D80 #x0DFF
Thai #x0E00 #x0E7F
Lao #x0E80 #x0EFF
Tibetan #x0F00 #x0FFF
Myanmar #x1000 #x109F
Georgian #x10A0 #x10FF
HangulJamo #x1100 #x11FF
Ethiopic #x1200 #x137F
Cherokee #x13A0 #x13FF
UnifiedCanadianAboriginalSyllabics #x1400 #x167F
Ogham #x1680 #x169F
Runic #x16A0 #x16FF
Khmer #x1780 #x17FF
Mongolian #x1800 #x18AF
LatinExtendedAdditional #x1E00 #x1EFF
GreekExtended #x1F00 #x1FFF
GeneralPunctuation #x2000 #x206F
SuperscriptsandSubscripts #x2070 #x209F
CurrencySymbols #x20A0 #x20CF
CombiningMarksforSymbols #x20D0 #x20FF
LetterlikeSymbols #x2100 #x214F
NumberForms #x2150 #x218F
Arrows #x2190 #x21FF
MathematicalOperators #x2200 #x22FF
MiscellaneousTechnical #x2300 #x23FF
ControlPictures #x2400 #x243F
OpticalCharacterRecognition #x2440 #x245F
EnclosedAlphanumerics #x2460 #x24FF
BoxDrawing #x2500 #x257F
BlockElements #x2580 #x259F
GeometricShapes #x25A0 #x25FF
MiscellaneousSymbols #x2600 #x26FF
Dingbats #x2700 #x27BF
BraillePatterns #x2800 #x28FF
CJKRadicalsSupplement #x2E80 #x2EFF
KangxiRadicals #x2F00 #x2FDF
IdeographicDescriptionCharacters #x2FF0 #x2FFF
CJKSymbolsandPunctuation #x3000 #x303F
Hiragana #x3040 #x309F
Katakana #x30A0 #x30FF
Bopomofo #x3100 #x312F
HangulCompatibilityJamo #x3130 #x318F
Kanbun #x3190 #x319F
BopomofoExtended #x31A0 #x31BF
EnclosedCJKLettersandMonths #x3200 #x32FF
CJKCompatibility #x3300 #x33FF
CJKUnifiedIdeographsExtensionA #x3400 #x4DB5
CJKUnifiedIdeographs #x4E00 #x9FFF
YiSyllables #xA000 #xA48F
YiRadicals #xA490 #xA4CF
HangulSyllables #xAC00 #xD7A3
HighSurrogates #xD800 #xDB7F
HighPrivateUseSurrogates #xDB80 #xDBFF
LowSurrogates #xDC00 #xDFFF
PrivateUse #xE000 #xF8FF
CJKCompatibilityIdeographs #xF900 #xFAFF
AlphabeticPresentationForms #xFB00 #xFB4F
ArabicPresentationForms-A #xFB50 #xFDFF
CombiningHalfMarks #xFE20 #xFE2F
CJKCompatibilityForms #xFE30 #xFE4F
SmallFormVariants #xFE50 #xFE6F
ArabicPresentationForms-B #xFE70 #xFEFE
Specials #xFEFF #xFEFF
HalfwidthandFullwidthForms #xFF00 #xFFEF
Specials #xFFF0 #xFFFD
OldItalic #x10300 #x1032F
Gothic #x10330 #x1034F
Deseret #x10400 #x1044F
ByzantineMusicalSymbols #x1D000 #x1D0FF
MusicalSymbols #x1D100 #x1D1FF
MathematicalAlphanumericSymbols #x1D400 #x1D7FF
CJKUnifiedIdeographsExtensionB #x20000 #x2A6D6
CJKCompatibilityIdeographsSupplement #x2F800 #x2FA1F
Tags #xE0000 #xE007F
PrivateUse #xF0000 #x10FFFD

XML Schema - Regular Expressions - XML Character References

An expression may match a character by using the common XML character reference, which is a decimal number delimited by '&' and ';', or a hex number delimited by '&#' and ';'. For example, the uppercase letter 'Z' is referenced by the decimal representation 'Z' and the hex representation 'Z'. These numbers correspond directly to the characters documented in The Unicode Standard.