PageRenderTime 42ms CodeModel.GetById 19ms app.highlight 9ms RepoModel.GetById 2ms app.codeStats 0ms

/vendor/pcre/doc/html/pcrepattern.html

http://github.com/feyeleanor/RubyGoLightly
HTML | 2247 lines | 2239 code | 8 blank | 0 comment | 0 complexity | 813f8c4e136969904447f8a08c8db531 MD5 | raw file

Large files files are truncated, but you can click here to view the full file

   1<html>
   2<head>
   3<title>pcrepattern specification</title>
   4</head>
   5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
   6<h1>pcrepattern man page</h1>
   7<p>
   8Return to the <a href="index.html">PCRE index page</a>.
   9</p>
  10<p>
  11This page is part of the PCRE HTML documentation. It was generated automatically
  12from the original man page. If there is any nonsense in it, please consult the
  13man page, in case the conversion went wrong.
  14<br>
  15<ul>
  16<li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a>
  17<li><a name="TOC2" href="#SEC2">NEWLINE CONVENTIONS</a>
  18<li><a name="TOC3" href="#SEC3">CHARACTERS AND METACHARACTERS</a>
  19<li><a name="TOC4" href="#SEC4">BACKSLASH</a>
  20<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a>
  21<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT)</a>
  22<li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a>
  23<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a>
  24<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a>
  25<li><a name="TOC10" href="#SEC10">VERTICAL BAR</a>
  26<li><a name="TOC11" href="#SEC11">INTERNAL OPTION SETTING</a>
  27<li><a name="TOC12" href="#SEC12">SUBPATTERNS</a>
  28<li><a name="TOC13" href="#SEC13">DUPLICATE SUBPATTERN NUMBERS</a>
  29<li><a name="TOC14" href="#SEC14">NAMED SUBPATTERNS</a>
  30<li><a name="TOC15" href="#SEC15">REPETITION</a>
  31<li><a name="TOC16" href="#SEC16">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
  32<li><a name="TOC17" href="#SEC17">BACK REFERENCES</a>
  33<li><a name="TOC18" href="#SEC18">ASSERTIONS</a>
  34<li><a name="TOC19" href="#SEC19">CONDITIONAL SUBPATTERNS</a>
  35<li><a name="TOC20" href="#SEC20">COMMENTS</a>
  36<li><a name="TOC21" href="#SEC21">RECURSIVE PATTERNS</a>
  37<li><a name="TOC22" href="#SEC22">SUBPATTERNS AS SUBROUTINES</a>
  38<li><a name="TOC23" href="#SEC23">ONIGURUMA SUBROUTINE SYNTAX</a>
  39<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
  40<li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a>
  41<li><a name="TOC26" href="#SEC26">SEE ALSO</a>
  42<li><a name="TOC27" href="#SEC27">AUTHOR</a>
  43<li><a name="TOC28" href="#SEC28">REVISION</a>
  44</ul>
  45<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
  46<P>
  47The syntax and semantics of the regular expressions that are supported by PCRE
  48are described in detail below. There is a quick-reference syntax summary in the
  49<a href="pcresyntax.html"><b>pcresyntax</b></a>
  50page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
  51also supports some alternative regular expression syntax (which does not
  52conflict with the Perl syntax) in order to provide some compatibility with
  53regular expressions in Python, .NET, and Oniguruma.
  54</P>
  55<P>
  56Perl's regular expressions are described in its own documentation, and
  57regular expressions in general are covered in a number of books, some of which
  58have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
  59published by O'Reilly, covers regular expressions in great detail. This
  60description of PCRE's regular expressions is intended as reference material.
  61</P>
  62<P>
  63The original operation of PCRE was on strings of one-byte characters. However,
  64there is now also support for UTF-8 character strings. To use this, you must
  65build PCRE to include UTF-8 support, and then call <b>pcre_compile()</b> with
  66the PCRE_UTF8 option. How this affects pattern matching is mentioned in several
  67places below. There is also a summary of UTF-8 features in the
  68<a href="pcre.html#utf8support">section on UTF-8 support</a>
  69in the main
  70<a href="pcre.html"><b>pcre</b></a>
  71page.
  72</P>
  73<P>
  74The remainder of this document discusses the patterns that are supported by
  75PCRE when its main matching function, <b>pcre_exec()</b>, is used.
  76From release 6.0, PCRE offers a second matching function,
  77<b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not
  78Perl-compatible. Some of the features discussed below are not available when
  79<b>pcre_dfa_exec()</b> is used. The advantages and disadvantages of the
  80alternative function, and how it differs from the normal function, are
  81discussed in the
  82<a href="pcrematching.html"><b>pcrematching</b></a>
  83page.
  84</P>
  85<br><a name="SEC2" href="#TOC1">NEWLINE CONVENTIONS</a><br>
  86<P>
  87PCRE supports five different conventions for indicating line breaks in
  88strings: a single CR (carriage return) character, a single LF (linefeed)
  89character, the two-character sequence CRLF, any of the three preceding, or any
  90Unicode newline sequence. The
  91<a href="pcreapi.html"><b>pcreapi</b></a>
  92page has
  93<a href="pcreapi.html#newlines">further discussion</a>
  94about newlines, and shows how to set the newline convention in the
  95<i>options</i> arguments for the compiling and matching functions.
  96</P>
  97<P>
  98It is also possible to specify a newline convention by starting a pattern
  99string with one of the following five sequences:
 100<pre>
 101  (*CR)        carriage return
 102  (*LF)        linefeed
 103  (*CRLF)      carriage return, followed by linefeed
 104  (*ANYCRLF)   any of the three above
 105  (*ANY)       all Unicode newline sequences
 106</pre>
 107These override the default and the options given to <b>pcre_compile()</b>. For
 108example, on a Unix system where LF is the default newline sequence, the pattern
 109<pre>
 110  (*CR)a.b
 111</pre>
 112changes the convention to CR. That pattern matches "a\nb" because LF is no
 113longer a newline. Note that these special settings, which are not
 114Perl-compatible, are recognized only at the very start of a pattern, and that
 115they must be in upper case. If more than one of them is present, the last one
 116is used.
 117</P>
 118<P>
 119The newline convention does not affect what the \R escape sequence matches. By
 120default, this is any Unicode newline sequence, for Perl compatibility. However,
 121this can be changed; see the description of \R in the section entitled
 122<a href="#newlineseq">"Newline sequences"</a>
 123below. A change of \R setting can be combined with a change of newline
 124convention.
 125</P>
 126<br><a name="SEC3" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>
 127<P>
 128A regular expression is a pattern that is matched against a subject string from
 129left to right. Most characters stand for themselves in a pattern, and match the
 130corresponding characters in the subject. As a trivial example, the pattern
 131<pre>
 132  The quick brown fox
 133</pre>
 134matches a portion of a subject string that is identical to itself. When
 135caseless matching is specified (the PCRE_CASELESS option), letters are matched
 136independently of case. In UTF-8 mode, PCRE always understands the concept of
 137case for characters whose values are less than 128, so caseless matching is
 138always possible. For characters with higher values, the concept of case is
 139supported if PCRE is compiled with Unicode property support, but not otherwise.
 140If you want to use caseless matching for characters 128 and above, you must
 141ensure that PCRE is compiled with Unicode property support as well as with
 142UTF-8 support.
 143</P>
 144<P>
 145The power of regular expressions comes from the ability to include alternatives
 146and repetitions in the pattern. These are encoded in the pattern by the use of
 147<i>metacharacters</i>, which do not stand for themselves but instead are
 148interpreted in some special way.
 149</P>
 150<P>
 151There are two different sets of metacharacters: those that are recognized
 152anywhere in the pattern except within square brackets, and those that are
 153recognized within square brackets. Outside square brackets, the metacharacters
 154are as follows:
 155<pre>
 156  \      general escape character with several uses
 157  ^      assert start of string (or line, in multiline mode)
 158  $      assert end of string (or line, in multiline mode)
 159  .      match any character except newline (by default)
 160  [      start character class definition
 161  |      start of alternative branch
 162  (      start subpattern
 163  )      end subpattern
 164  ?      extends the meaning of (
 165         also 0 or 1 quantifier
 166         also quantifier minimizer
 167  *      0 or more quantifier
 168  +      1 or more quantifier
 169         also "possessive quantifier"
 170  {      start min/max quantifier
 171</pre>
 172Part of a pattern that is in square brackets is called a "character class". In
 173a character class the only metacharacters are:
 174<pre>
 175  \      general escape character
 176  ^      negate the class, but only if the first character
 177  -      indicates character range
 178  [      POSIX character class (only if followed by POSIX syntax)
 179  ]      terminates the character class
 180</pre>
 181The following sections describe the use of each of the metacharacters.
 182</P>
 183<br><a name="SEC4" href="#TOC1">BACKSLASH</a><br>
 184<P>
 185The backslash character has several uses. Firstly, if it is followed by a
 186non-alphanumeric character, it takes away any special meaning that character
 187may have. This use of backslash as an escape character applies both inside and
 188outside character classes.
 189</P>
 190<P>
 191For example, if you want to match a * character, you write \* in the pattern.
 192This escaping action applies whether or not the following character would
 193otherwise be interpreted as a metacharacter, so it is always safe to precede a
 194non-alphanumeric with backslash to specify that it stands for itself. In
 195particular, if you want to match a backslash, you write \\.
 196</P>
 197<P>
 198If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
 199pattern (other than in a character class) and characters between a # outside
 200a character class and the next newline are ignored. An escaping backslash can
 201be used to include a whitespace or # character as part of the pattern.
 202</P>
 203<P>
 204If you want to remove the special meaning from a sequence of characters, you
 205can do so by putting them between \Q and \E. This is different from Perl in
 206that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in
 207Perl, $ and @ cause variable interpolation. Note the following examples:
 208<pre>
 209  Pattern            PCRE matches   Perl matches
 210
 211  \Qabc$xyz\E        abc$xyz        abc followed by the contents of $xyz
 212  \Qabc\$xyz\E       abc\$xyz       abc\$xyz
 213  \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
 214</pre>
 215The \Q...\E sequence is recognized both inside and outside character classes.
 216<a name="digitsafterbackslash"></a></P>
 217<br><b>
 218Non-printing characters
 219</b><br>
 220<P>
 221A second use of backslash provides a way of encoding non-printing characters
 222in patterns in a visible manner. There is no restriction on the appearance of
 223non-printing characters, apart from the binary zero that terminates a pattern,
 224but when a pattern is being prepared by text editing, it is usually easier to
 225use one of the following escape sequences than the binary character it
 226represents:
 227<pre>
 228  \a        alarm, that is, the BEL character (hex 07)
 229  \cx       "control-x", where x is any character
 230  \e        escape (hex 1B)
 231  \f        formfeed (hex 0C)
 232  \n        linefeed (hex 0A)
 233  \r        carriage return (hex 0D)
 234  \t        tab (hex 09)
 235  \ddd      character with octal code ddd, or backreference
 236  \xhh      character with hex code hh
 237  \x{hhh..} character with hex code hhh..
 238</pre>
 239The precise effect of \cx is as follows: if x is a lower case letter, it
 240is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
 241Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; becomes hex
 2427B.
 243</P>
 244<P>
 245After \x, from zero to two hexadecimal digits are read (letters can be in
 246upper or lower case). Any number of hexadecimal digits may appear between \x{
 247and }, but the value of the character code must be less than 256 in non-UTF-8
 248mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in
 249hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code
 250point, which is 10FFFF.
 251</P>
 252<P>
 253If characters other than hexadecimal digits appear between \x{ and }, or if
 254there is no terminating }, this form of escape is not recognized. Instead, the
 255initial \x will be interpreted as a basic hexadecimal escape, with no
 256following digits, giving a character whose value is zero.
 257</P>
 258<P>
 259Characters whose value is less than 256 can be defined by either of the two
 260syntaxes for \x. There is no difference in the way they are handled. For
 261example, \xdc is exactly the same as \x{dc}.
 262</P>
 263<P>
 264After \0 up to two further octal digits are read. If there are fewer than two
 265digits, just those that are present are used. Thus the sequence \0\x\07
 266specifies two binary zeros followed by a BEL character (code value 7). Make
 267sure you supply two digits after the initial zero if the pattern character that
 268follows is itself an octal digit.
 269</P>
 270<P>
 271The handling of a backslash followed by a digit other than 0 is complicated.
 272Outside a character class, PCRE reads it and any following digits as a decimal
 273number. If the number is less than 10, or if there have been at least that many
 274previous capturing left parentheses in the expression, the entire sequence is
 275taken as a <i>back reference</i>. A description of how this works is given
 276<a href="#backreferences">later,</a>
 277following the discussion of
 278<a href="#subpattern">parenthesized subpatterns.</a>
 279</P>
 280<P>
 281Inside a character class, or if the decimal number is greater than 9 and there
 282have not been that many capturing subpatterns, PCRE re-reads up to three octal
 283digits following the backslash, and uses them to generate a data character. Any
 284subsequent digits stand for themselves. In non-UTF-8 mode, the value of a
 285character specified in octal must be less than \400. In UTF-8 mode, values up
 286to \777 are permitted. For example:
 287<pre>
 288  \040   is another way of writing a space
 289  \40    is the same, provided there are fewer than 40 previous capturing subpatterns
 290  \7     is always a back reference
 291  \11    might be a back reference, or another way of writing a tab
 292  \011   is always a tab
 293  \0113  is a tab followed by the character "3"
 294  \113   might be a back reference, otherwise the character with octal code 113
 295  \377   might be a back reference, otherwise the byte consisting entirely of 1 bits
 296  \81    is either a back reference, or a binary zero followed by the two characters "8" and "1"
 297</pre>
 298Note that octal values of 100 or greater must not be introduced by a leading
 299zero, because no more than three octal digits are ever read.
 300</P>
 301<P>
 302All the sequences that define a single character value can be used both inside
 303and outside character classes. In addition, inside a character class, the
 304sequence \b is interpreted as the backspace character (hex 08), and the
 305sequences \R and \X are interpreted as the characters "R" and "X",
 306respectively. Outside a character class, these sequences have different
 307meanings
 308<a href="#uniextseq">(see below).</a>
 309</P>
 310<br><b>
 311Absolute and relative back references
 312</b><br>
 313<P>
 314The sequence \g followed by an unsigned or a negative number, optionally
 315enclosed in braces, is an absolute or relative back reference. A named back
 316reference can be coded as \g{name}. Back references are discussed
 317<a href="#backreferences">later,</a>
 318following the discussion of
 319<a href="#subpattern">parenthesized subpatterns.</a>
 320</P>
 321<br><b>
 322Absolute and relative subroutine calls
 323</b><br>
 324<P>
 325For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
 326a number enclosed either in angle brackets or single quotes, is an alternative
 327syntax for referencing a subpattern as a "subroutine". Details are discussed
 328<a href="#onigurumasubroutines">later.</a>
 329Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
 330synonymous. The former is a back reference; the latter is a subroutine call.
 331</P>
 332<br><b>
 333Generic character types
 334</b><br>
 335<P>
 336Another use of backslash is for specifying generic character types. The
 337following are always recognized:
 338<pre>
 339  \d     any decimal digit
 340  \D     any character that is not a decimal digit
 341  \h     any horizontal whitespace character
 342  \H     any character that is not a horizontal whitespace character
 343  \s     any whitespace character
 344  \S     any character that is not a whitespace character
 345  \v     any vertical whitespace character
 346  \V     any character that is not a vertical whitespace character
 347  \w     any "word" character
 348  \W     any "non-word" character
 349</pre>
 350Each pair of escape sequences partitions the complete set of characters into
 351two disjoint sets. Any given character matches one, and only one, of each pair.
 352</P>
 353<P>
 354These character type sequences can appear both inside and outside character
 355classes. They each match one character of the appropriate type. If the current
 356matching point is at the end of the subject string, all of them fail, since
 357there is no character to match.
 358</P>
 359<P>
 360For compatibility with Perl, \s does not match the VT character (code 11).
 361This makes it different from the the POSIX "space" class. The \s characters
 362are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is
 363included in a Perl script, \s may match the VT character. In PCRE, it never
 364does.
 365</P>
 366<P>
 367In UTF-8 mode, characters with values greater than 128 never match \d, \s, or
 368\w, and always match \D, \S, and \W. This is true even when Unicode
 369character property support is available. These sequences retain their original
 370meanings from before UTF-8 support was available, mainly for efficiency
 371reasons.
 372</P>
 373<P>
 374The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the
 375other sequences, these do match certain high-valued codepoints in UTF-8 mode.
 376The horizontal space characters are:
 377<pre>
 378  U+0009     Horizontal tab
 379  U+0020     Space
 380  U+00A0     Non-break space
 381  U+1680     Ogham space mark
 382  U+180E     Mongolian vowel separator
 383  U+2000     En quad
 384  U+2001     Em quad
 385  U+2002     En space
 386  U+2003     Em space
 387  U+2004     Three-per-em space
 388  U+2005     Four-per-em space
 389  U+2006     Six-per-em space
 390  U+2007     Figure space
 391  U+2008     Punctuation space
 392  U+2009     Thin space
 393  U+200A     Hair space
 394  U+202F     Narrow no-break space
 395  U+205F     Medium mathematical space
 396  U+3000     Ideographic space
 397</pre>
 398The vertical space characters are:
 399<pre>
 400  U+000A     Linefeed
 401  U+000B     Vertical tab
 402  U+000C     Formfeed
 403  U+000D     Carriage return
 404  U+0085     Next line
 405  U+2028     Line separator
 406  U+2029     Paragraph separator
 407</PRE>
 408</P>
 409<P>
 410A "word" character is an underscore or any character less than 256 that is a
 411letter or digit. The definition of letters and digits is controlled by PCRE's
 412low-valued character tables, and may vary if locale-specific matching is taking
 413place (see
 414<a href="pcreapi.html#localesupport">"Locale support"</a>
 415in the
 416<a href="pcreapi.html"><b>pcreapi</b></a>
 417page). For example, in a French locale such as "fr_FR" in Unix-like systems,
 418or "french" in Windows, some character codes greater than 128 are used for
 419accented letters, and these are matched by \w. The use of locales with Unicode
 420is discouraged.
 421<a name="newlineseq"></a></P>
 422<br><b>
 423Newline sequences
 424</b><br>
 425<P>
 426Outside a character class, by default, the escape sequence \R matches any
 427Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is
 428equivalent to the following:
 429<pre>
 430  (?&#62;\r\n|\n|\x0b|\f|\r|\x85)
 431</pre>
 432This is an example of an "atomic group", details of which are given
 433<a href="#atomicgroup">below.</a>
 434This particular group matches either the two-character sequence CR followed by
 435LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
 436U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next
 437line, U+0085). The two-character sequence is treated as a single unit that
 438cannot be split.
 439</P>
 440<P>
 441In UTF-8 mode, two additional characters whose codepoints are greater than 255
 442are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
 443Unicode character property support is not needed for these characters to be
 444recognized.
 445</P>
 446<P>
 447It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
 448complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF
 449either at compile time or when the pattern is matched. (BSR is an abbrevation
 450for "backslash R".) This can be made the default when PCRE is built; if this is
 451the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option.
 452It is also possible to specify these settings by starting a pattern string with
 453one of the following sequences:
 454<pre>
 455  (*BSR_ANYCRLF)   CR, LF, or CRLF only
 456  (*BSR_UNICODE)   any Unicode newline sequence
 457</pre>
 458These override the default and the options given to <b>pcre_compile()</b>, but
 459they can be overridden by options given to <b>pcre_exec()</b>. Note that these
 460special settings, which are not Perl-compatible, are recognized only at the
 461very start of a pattern, and that they must be in upper case. If more than one
 462of them is present, the last one is used. They can be combined with a change of
 463newline convention, for example, a pattern can start with:
 464<pre>
 465  (*ANY)(*BSR_ANYCRLF)
 466</pre>
 467Inside a character class, \R matches the letter "R".
 468<a name="uniextseq"></a></P>
 469<br><b>
 470Unicode character properties
 471</b><br>
 472<P>
 473When PCRE is built with Unicode character property support, three additional
 474escape sequences that match characters with specific properties are available.
 475When not in UTF-8 mode, these sequences are of course limited to testing
 476characters whose codepoints are less than 256, but they do work in this mode.
 477The extra escape sequences are:
 478<pre>
 479  \p{<i>xx</i>}   a character with the <i>xx</i> property
 480  \P{<i>xx</i>}   a character without the <i>xx</i> property
 481  \X       an extended Unicode sequence
 482</pre>
 483The property names represented by <i>xx</i> above are limited to the Unicode
 484script names, the general category properties, and "Any", which matches any
 485character (including newline). Other properties such as "InMusicalSymbols" are
 486not currently supported by PCRE. Note that \P{Any} does not match any
 487characters, so always causes a match failure.
 488</P>
 489<P>
 490Sets of Unicode characters are defined as belonging to certain scripts. A
 491character from one of these sets can be matched using a script name. For
 492example:
 493<pre>
 494  \p{Greek}
 495  \P{Han}
 496</pre>
 497Those that are not part of an identified script are lumped together as
 498"Common". The current list of scripts is:
 499</P>
 500<P>
 501Arabic,
 502Armenian,
 503Balinese,
 504Bengali,
 505Bopomofo,
 506Braille,
 507Buginese,
 508Buhid,
 509Canadian_Aboriginal,
 510Cherokee,
 511Common,
 512Coptic,
 513Cuneiform,
 514Cypriot,
 515Cyrillic,
 516Deseret,
 517Devanagari,
 518Ethiopic,
 519Georgian,
 520Glagolitic,
 521Gothic,
 522Greek,
 523Gujarati,
 524Gurmukhi,
 525Han,
 526Hangul,
 527Hanunoo,
 528Hebrew,
 529Hiragana,
 530Inherited,
 531Kannada,
 532Katakana,
 533Kharoshthi,
 534Khmer,
 535Lao,
 536Latin,
 537Limbu,
 538Linear_B,
 539Malayalam,
 540Mongolian,
 541Myanmar,
 542New_Tai_Lue,
 543Nko,
 544Ogham,
 545Old_Italic,
 546Old_Persian,
 547Oriya,
 548Osmanya,
 549Phags_Pa,
 550Phoenician,
 551Runic,
 552Shavian,
 553Sinhala,
 554Syloti_Nagri,
 555Syriac,
 556Tagalog,
 557Tagbanwa,
 558Tai_Le,
 559Tamil,
 560Telugu,
 561Thaana,
 562Thai,
 563Tibetan,
 564Tifinagh,
 565Ugaritic,
 566Yi.
 567</P>
 568<P>
 569Each character has exactly one general category property, specified by a
 570two-letter abbreviation. For compatibility with Perl, negation can be specified
 571by including a circumflex between the opening brace and the property name. For
 572example, \p{^Lu} is the same as \P{Lu}.
 573</P>
 574<P>
 575If only one letter is specified with \p or \P, it includes all the general
 576category properties that start with that letter. In this case, in the absence
 577of negation, the curly brackets in the escape sequence are optional; these two
 578examples have the same effect:
 579<pre>
 580  \p{L}
 581  \pL
 582</pre>
 583The following general category property codes are supported:
 584<pre>
 585  C     Other
 586  Cc    Control
 587  Cf    Format
 588  Cn    Unassigned
 589  Co    Private use
 590  Cs    Surrogate
 591
 592  L     Letter
 593  Ll    Lower case letter
 594  Lm    Modifier letter
 595  Lo    Other letter
 596  Lt    Title case letter
 597  Lu    Upper case letter
 598
 599  M     Mark
 600  Mc    Spacing mark
 601  Me    Enclosing mark
 602  Mn    Non-spacing mark
 603
 604  N     Number
 605  Nd    Decimal number
 606  Nl    Letter number
 607  No    Other number
 608
 609  P     Punctuation
 610  Pc    Connector punctuation
 611  Pd    Dash punctuation
 612  Pe    Close punctuation
 613  Pf    Final punctuation
 614  Pi    Initial punctuation
 615  Po    Other punctuation
 616  Ps    Open punctuation
 617
 618  S     Symbol
 619  Sc    Currency symbol
 620  Sk    Modifier symbol
 621  Sm    Mathematical symbol
 622  So    Other symbol
 623
 624  Z     Separator
 625  Zl    Line separator
 626  Zp    Paragraph separator
 627  Zs    Space separator
 628</pre>
 629The special property L& is also supported: it matches a character that has
 630the Lu, Ll, or Lt property, in other words, a letter that is not classified as
 631a modifier or "other".
 632</P>
 633<P>
 634The Cs (Surrogate) property applies only to characters in the range U+D800 to
 635U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so
 636cannot be tested by PCRE, unless UTF-8 validity checking has been turned off
 637(see the discussion of PCRE_NO_UTF8_CHECK in the
 638<a href="pcreapi.html"><b>pcreapi</b></a>
 639page).
 640</P>
 641<P>
 642The long synonyms for these properties that Perl supports (such as \p{Letter})
 643are not supported by PCRE, nor is it permitted to prefix any of these
 644properties with "Is".
 645</P>
 646<P>
 647No character that is in the Unicode table has the Cn (unassigned) property.
 648Instead, this property is assumed for any code point that is not in the
 649Unicode table.
 650</P>
 651<P>
 652Specifying caseless matching does not affect these escape sequences. For
 653example, \p{Lu} always matches only upper case letters.
 654</P>
 655<P>
 656The \X escape matches any number of Unicode characters that form an extended
 657Unicode sequence. \X is equivalent to
 658<pre>
 659  (?&#62;\PM\pM*)
 660</pre>
 661That is, it matches a character without the "mark" property, followed by zero
 662or more characters with the "mark" property, and treats the sequence as an
 663atomic group
 664<a href="#atomicgroup">(see below).</a>
 665Characters with the "mark" property are typically accents that affect the
 666preceding character. None of them have codepoints less than 256, so in
 667non-UTF-8 mode \X matches any one character.
 668</P>
 669<P>
 670Matching characters by Unicode property is not fast, because PCRE has to search
 671a structure that contains data for over fifteen thousand characters. That is
 672why the traditional escape sequences such as \d and \w do not use Unicode
 673properties in PCRE.
 674<a name="resetmatchstart"></a></P>
 675<br><b>
 676Resetting the match start
 677</b><br>
 678<P>
 679The escape sequence \K, which is a Perl 5.10 feature, causes any previously
 680matched characters not to be included in the final matched sequence. For
 681example, the pattern:
 682<pre>
 683  foo\Kbar
 684</pre>
 685matches "foobar", but reports that it has matched "bar". This feature is
 686similar to a lookbehind assertion
 687<a href="#lookbehind">(described below).</a>
 688However, in this case, the part of the subject before the real match does not
 689have to be of fixed length, as lookbehind assertions do. The use of \K does
 690not interfere with the setting of
 691<a href="#subpattern">captured substrings.</a>
 692For example, when the pattern
 693<pre>
 694  (foo)\Kbar
 695</pre>
 696matches "foobar", the first substring is still set to "foo".
 697<a name="smallassertions"></a></P>
 698<br><b>
 699Simple assertions
 700</b><br>
 701<P>
 702The final use of backslash is for certain simple assertions. An assertion
 703specifies a condition that has to be met at a particular point in a match,
 704without consuming any characters from the subject string. The use of
 705subpatterns for more complicated assertions is described
 706<a href="#bigassertions">below.</a>
 707The backslashed assertions are:
 708<pre>
 709  \b     matches at a word boundary
 710  \B     matches when not at a word boundary
 711  \A     matches at the start of the subject
 712  \Z     matches at the end of the subject
 713          also matches before a newline at the end of the subject
 714  \z     matches only at the end of the subject
 715  \G     matches at the first matching position in the subject
 716</pre>
 717These assertions may not appear in character classes (but note that \b has a
 718different meaning, namely the backspace character, inside a character class).
 719</P>
 720<P>
 721A word boundary is a position in the subject string where the current character
 722and the previous character do not both match \w or \W (i.e. one matches
 723\w and the other matches \W), or the start or end of the string if the
 724first or last character matches \w, respectively.
 725</P>
 726<P>
 727The \A, \Z, and \z assertions differ from the traditional circumflex and
 728dollar (described in the next section) in that they only ever match at the very
 729start and end of the subject string, whatever options are set. Thus, they are
 730independent of multiline mode. These three assertions are not affected by the
 731PCRE_NOTBOL or PCRE_NOTEOL options, which affect only the behaviour of the
 732circumflex and dollar metacharacters. However, if the <i>startoffset</i>
 733argument of <b>pcre_exec()</b> is non-zero, indicating that matching is to start
 734at a point other than the beginning of the subject, \A can never match. The
 735difference between \Z and \z is that \Z matches before a newline at the end
 736of the string as well as at the very end, whereas \z matches only at the end.
 737</P>
 738<P>
 739The \G assertion is true only when the current matching position is at the
 740start point of the match, as specified by the <i>startoffset</i> argument of
 741<b>pcre_exec()</b>. It differs from \A when the value of <i>startoffset</i> is
 742non-zero. By calling <b>pcre_exec()</b> multiple times with appropriate
 743arguments, you can mimic Perl's /g option, and it is in this kind of
 744implementation where \G can be useful.
 745</P>
 746<P>
 747Note, however, that PCRE's interpretation of \G, as the start of the current
 748match, is subtly different from Perl's, which defines it as the end of the
 749previous match. In Perl, these can be different when the previously matched
 750string was empty. Because PCRE does just one match at a time, it cannot
 751reproduce this behaviour.
 752</P>
 753<P>
 754If all the alternatives of a pattern begin with \G, the expression is anchored
 755to the starting match position, and the "anchored" flag is set in the compiled
 756regular expression.
 757</P>
 758<br><a name="SEC5" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
 759<P>
 760Outside a character class, in the default matching mode, the circumflex
 761character is an assertion that is true only if the current matching point is
 762at the start of the subject string. If the <i>startoffset</i> argument of
 763<b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE
 764option is unset. Inside a character class, circumflex has an entirely different
 765meaning
 766<a href="#characterclass">(see below).</a>
 767</P>
 768<P>
 769Circumflex need not be the first character of the pattern if a number of
 770alternatives are involved, but it should be the first thing in each alternative
 771in which it appears if the pattern is ever to match that branch. If all
 772possible alternatives start with a circumflex, that is, if the pattern is
 773constrained to match only at the start of the subject, it is said to be an
 774"anchored" pattern. (There are also other constructs that can cause a pattern
 775to be anchored.)
 776</P>
 777<P>
 778A dollar character is an assertion that is true only if the current matching
 779point is at the end of the subject string, or immediately before a newline
 780at the end of the string (by default). Dollar need not be the last character of
 781the pattern if a number of alternatives are involved, but it should be the last
 782item in any branch in which it appears. Dollar has no special meaning in a
 783character class.
 784</P>
 785<P>
 786The meaning of dollar can be changed so that it matches only at the very end of
 787the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This
 788does not affect the \Z assertion.
 789</P>
 790<P>
 791The meanings of the circumflex and dollar characters are changed if the
 792PCRE_MULTILINE option is set. When this is the case, a circumflex matches
 793immediately after internal newlines as well as at the start of the subject
 794string. It does not match after a newline that ends the string. A dollar
 795matches before any newlines in the string, as well as at the very end, when
 796PCRE_MULTILINE is set. When newline is specified as the two-character
 797sequence CRLF, isolated CR and LF characters do not indicate newlines.
 798</P>
 799<P>
 800For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
 801\n represents a newline) in multiline mode, but not otherwise. Consequently,
 802patterns that are anchored in single line mode because all branches start with
 803^ are not anchored in multiline mode, and a match for circumflex is possible
 804when the <i>startoffset</i> argument of <b>pcre_exec()</b> is non-zero. The
 805PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
 806</P>
 807<P>
 808Note that the sequences \A, \Z, and \z can be used to match the start and
 809end of the subject in both modes, and if all branches of a pattern start with
 810\A it is always anchored, whether or not PCRE_MULTILINE is set.
 811</P>
 812<br><a name="SEC6" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br>
 813<P>
 814Outside a character class, a dot in the pattern matches any one character in
 815the subject string except (by default) a character that signifies the end of a
 816line. In UTF-8 mode, the matched character may be more than one byte long.
 817</P>
 818<P>
 819When a line ending is defined as a single character, dot never matches that
 820character; when the two-character sequence CRLF is used, dot does not match CR
 821if it is immediately followed by LF, but otherwise it matches all characters
 822(including isolated CRs and LFs). When any Unicode line endings are being
 823recognized, dot does not match CR or LF or any of the other line ending
 824characters.
 825</P>
 826<P>
 827The behaviour of dot with regard to newlines can be changed. If the PCRE_DOTALL
 828option is set, a dot matches any one character, without exception. If the
 829two-character sequence CRLF is present in the subject string, it takes two dots
 830to match it.
 831</P>
 832<P>
 833The handling of dot is entirely independent of the handling of circumflex and
 834dollar, the only relationship being that they both involve newlines. Dot has no
 835special meaning in a character class.
 836</P>
 837<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
 838<P>
 839Outside a character class, the escape sequence \C matches any one byte, both
 840in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
 841characters. The feature is provided in Perl in order to match individual bytes
 842in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes,
 843what remains in the string may be a malformed UTF-8 string. For this reason,
 844the \C escape sequence is best avoided.
 845</P>
 846<P>
 847PCRE does not allow \C to appear in lookbehind assertions
 848<a href="#lookbehind">(described below),</a>
 849because in UTF-8 mode this would make it impossible to calculate the length of
 850the lookbehind.
 851<a name="characterclass"></a></P>
 852<br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
 853<P>
 854An opening square bracket introduces a character class, terminated by a closing
 855square bracket. A closing square bracket on its own is not special. If a
 856closing square bracket is required as a member of the class, it should be the
 857first data character in the class (after an initial circumflex, if present) or
 858escaped with a backslash.
 859</P>
 860<P>
 861A character class matches a single character in the subject. In UTF-8 mode, the
 862character may occupy more than one byte. A matched character must be in the set
 863of characters defined by the class, unless the first character in the class
 864definition is a circumflex, in which case the subject character must not be in
 865the set defined by the class. If a circumflex is actually required as a member
 866of the class, ensure it is not the first character, or escape it with a
 867backslash.
 868</P>
 869<P>
 870For example, the character class [aeiou] matches any lower case vowel, while
 871[^aeiou] matches any character that is not a lower case vowel. Note that a
 872circumflex is just a convenient notation for specifying the characters that
 873are in the class by enumerating those that are not. A class that starts with a
 874circumflex is not an assertion: it still consumes a character from the subject
 875string, and therefore it fails if the current pointer is at the end of the
 876string.
 877</P>
 878<P>
 879In UTF-8 mode, characters with values greater than 255 can be included in a
 880class as a literal string of bytes, or by using the \x{ escaping mechanism.
 881</P>
 882<P>
 883When caseless matching is set, any letters in a class represent both their
 884upper case and lower case versions, so for example, a caseless [aeiou] matches
 885"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
 886caseful version would. In UTF-8 mode, PCRE always understands the concept of
 887case for characters whose values are less than 128, so caseless matching is
 888always possible. For characters with higher values, the concept of case is
 889supported if PCRE is compiled with Unicode property support, but not otherwise.
 890If you want to use caseless matching for characters 128 and above, you must
 891ensure that PCRE is compiled with Unicode property support as well as with
 892UTF-8 support.
 893</P>
 894<P>
 895Characters that might indicate line breaks are never treated in any special way
 896when matching character classes, whatever line-ending sequence is in use, and
 897whatever setting of the PCRE_DOTALL and PCRE_MULTILINE options is used. A class
 898such as [^a] always matches one of these characters.
 899</P>
 900<P>
 901The minus (hyphen) character can be used to specify a range of characters in a
 902character class. For example, [d-m] matches any letter between d and m,
 903inclusive. If a minus character is required in a class, it must be escaped with
 904a backslash or appear in a position where it cannot be interpreted as
 905indicating a range, typically as the first or last character in the class.
 906</P>
 907<P>
 908It is not possible to have the literal character "]" as the end character of a
 909range. A pattern such as [W-]46] is interpreted as a class of two characters
 910("W" and "-") followed by a literal string "46]", so it would match "W46]" or
 911"-46]". However, if the "]" is escaped with a backslash it is interpreted as
 912the end of range, so [W-\]46] is interpreted as a class containing a range
 913followed by two other characters. The octal or hexadecimal representation of
 914"]" can also be used to end a range.
 915</P>
 916<P>
 917Ranges operate in the collating sequence of character values. They can also be
 918used for characters specified numerically, for example [\000-\037]. In UTF-8
 919mode, ranges can include characters whose values are greater than 255, for
 920example [\x{100}-\x{2ff}].
 921</P>
 922<P>
 923If a range that includes letters is used when caseless matching is set, it
 924matches the letters in either case. For example, [W-c] is equivalent to
 925[][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character
 926tables for a French locale are in use, [\xc8-\xcb] matches accented E
 927characters in both cases. In UTF-8 mode, PCRE supports the concept of case for
 928characters with values greater than 128 only when it is compiled with Unicode
 929property support.
 930</P>
 931<P>
 932The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
 933in a character class, and add the characters that they match to the class. For
 934example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
 935conveniently be used with the upper case character types to specify a more
 936restricted set of characters than the matching lower case type. For example,
 937the class [^\W_] matches any letter or digit, but not underscore.
 938</P>
 939<P>
 940The only metacharacters that are recognized in character classes are backslash,
 941hyphen (only where it can be interpreted as specifying a range), circumflex
 942(only at the start), opening square bracket (only when it can be interpreted as
 943introducing a POSIX class name - see the next section), and the terminating
 944closing square bracket. However, escaping other non-alphanumeric characters
 945does no harm.
 946</P>
 947<br><a name="SEC9" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
 948<P>
 949Perl supports the POSIX notation for character classes. This uses names
 950enclosed by [: and :] within the enclosing square brackets. PCRE also supports
 951this notation. For example,
 952<pre>
 953  [01[:alpha:]%]
 954</pre>
 955matches "0", "1", any alphabetic character, or "%". The supported class names
 956are
 957<pre>
 958  alnum    letters and digits
 959  alpha    letters
 960  ascii    character codes 0 - 127
 961  blank    space or tab only
 962  cntrl    control characters
 963  digit    decimal digits (same as \d)
 964  graph    printing characters, excluding space
 965  lower    lower case letters
 966  print    printing characters, including space
 967  punct    printing characters, excluding letters and digits
 968  space    white space (not quite the same as \s)
 969  upper    upper case letters
 970  word     "word" characters (same as \w)
 971  xdigit   hexadecimal digits
 972</pre>
 973The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
 974space (32). Notice that this list includes the VT character (code 11). This
 975makes "space" different to \s, which does not include VT (for Perl
 976compatibility).
 977</P>
 978<P>
 979The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
 9805.8. Another Perl extension is negation, which is indicated by a ^ character
 981after the colon. For example,
 982<pre>
 983  [12[:^digit:]]
 984</pre>
 985matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
 986syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
 987supported, and an error is given if they are encountered.
 988</P>
 989<P>
 990In UTF-8 mode, characters with values greater than 128 do not match any of
 991the POSIX character classes.
 992</P>
 993<br><a name="SEC10" href="#TOC1">VERTICAL BAR</a><br>
 994<P>
 995Vertical bar characters are used to separate alternative patterns. For example,
 996the pattern
 997<pre>
 998  gilbert|sullivan
 999</pre>
1000matches either "gilbert" or "sullivan". Any number of alternatives may appear,
1001and an empty alternative is permitted (matching the empty string). The matching
1002process tries each alternative in turn, from left to right, and the first one
1003that succeeds is used. If the alternatives are within a subpattern
1004<a href="#subpattern">(defined below),</a>
1005"succeeds" means matching the rest of the main pattern as well as the
1006alternative in the subpattern.
1007</P>
1008<br><a name="SEC11" href="#TOC1">INTERNAL OPTION SETTING</a><br>
1009<P>
1010The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
1011PCRE_EXTENDED options (which are Perl-compatible) can be changed from within
1012the pattern by a sequence of Perl option letters enclosed between "(?" and ")".
1013The option letters are
1014<pre>
1015  i  for PCRE_CASELESS
1016  m  for PCRE_MULTILINE
1017  s  for PCRE_DOTALL
1018  x  for PCRE_EXTENDED
1019</pre>
1020For example, (?im) sets caseless, multiline matching. It is also possible to
1021unset these options by preceding the letter with a hyphen, and a combined
1022setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
1023PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
1024permitted. If a letter appears both before and after the hyphen, the option is
1025unset.
1026</P>
1027<P>
1028The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
1029changed in the same way as the Perl-compatible options by using the characters
1030J, U and X respectively.
1031</P>
1032<P>
1033When an option change occurs at top level (that is, not inside subpattern
1034parentheses), the change applies to the remainder of the pattern that follows.
1035If the change is placed right at the start of a pattern, PCRE extracts it into
1036the global options (and it will therefore show up in data extracted by the
1037<b>pcre_fullinfo()</b> function).
1038</P>
1039<P>
1040An option change within a subpattern (see below for a description of
1041subpatterns) affects only that part of the current pattern that follows it, so
1042<pre>
1043  (a(?i)b)c
1044</pre>
1045matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
1046By this means, options can be made to have different settings in different
1047parts of the pattern. Any changes made in one alternative do carry on
1048into subsequent branches within the same subpattern. For example,
1049<pre>
1050  (a(?i)b|c)
1051</pre>
1052matches "ab", "aB", "c", and "C", even though when matching "C" the first
1053branch is abandoned before the option setting. This is because the effects of
1054option settings happen at compile time. There would be some very weird
1055behaviour otherwise.
1056</P>
1057<P>
1058<b>Note:</b> There are other PCRE-specific options that can be set by the
1059application when the compile or match functions are called. In some cases the
1060pattern can contain special leading sequences to override what the application
1061has set or what has been defaulted. Details are given in the section entitled
1062<a href="#newlineseq">"Newline sequences"</a>
1063above.
1064<a name="subpattern"></a></P>
1065<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br>
1066<P>
1067Subpatterns are delimited by parentheses (round brackets), which can be nested.
1068Turning part of a pattern into a subpattern does two things:
1069<br>
1070<br>
10711. It localizes a set of alternatives. For example, the pattern
1072<pre>
1073  cat(aract|erpillar|)
1074</pre>
1075matches one of the words "cat", "cataract", or "caterpillar". Without the
1076parentheses, it would match "cataract", "erpillar" or an empty string.
1077<br>
1078<br>
10792. It sets up the subpattern as a capturing subpattern. This means that, when
1080the whole pattern matches, that portion of the subject string that matched the
1081subpattern is passed back to the caller via the <i>ovector</i> argument of
1082<b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting
1083from 1) to obtain numbers for the capturing subpatterns.
1084</P>
1085<P>
1086For example, if the string "the red king" is matched against the pattern
1087<pre>
1088  the ((red|white) (king|queen))
1089</pre>
1090the captured substrings are "red king", "red", and "king", and are numbered 1,
10912, and 3, respectively.
1092</P>
1093<P>
1094The fact that plain parentheses fulfil two functions is not always helpful.
1095There are often times when a grouping subpattern is required without a
1096capturing requirement. If an opening parenthesis is followed by a question mark
1097and a colon, the subpattern does not do any capturing, and is not counted when
1098computing the number of any subsequent capturing subpatterns. For example, if
1099the string "the white queen" is matched against the pattern
1100<pre>
1101  the ((?:red|white) (king|queen))
1102</pre>
1103the captured substrings are "white queen" and "queen", and are numbered 1 and
11042. The maximum number of capturing subpatterns is 65535.
1105</P>
1106<P>
1107As a convenient shorthand, if any option settings are required at the start of
1108a non-capturing subpattern, the option letters may appear between the "?" and
1109the ":". Thus the two patterns
1110<pre>
1111  (?i:saturday|sunday)
1112  (?:(?i)saturday|sunday)
1113</pre>
1114match exactly the same set of strings. Because alternative branches are tried
1115from left to right, and options are not reset until the end of the subpattern
1116is reached, an option setting in one branch does affect subsequent branches, so
1117the above patterns match "SUNDAY" as well as "Saturday".
1118</P>
1119<br><a name="SEC13" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>
1120<P>
1121Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
1122the same numbers for its capturing parentheses. Such a subpattern starts with
1123(?| and is itself a non-capturing subpattern. For example, consider this
1124pattern:
1125<pre>
1126  (?|(Sat)ur|(Sun))day
1127</pre>
1128Because the two alternatives are inside a (?| group, both sets of capturing
1129parentheses are numbered one. Thus, when the pattern matches, you can look
1130at captured substring number one, whichever alternative matched. This construct
1131is useful when you want to capture part, but not all, of one of a number of
1132alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1133number is reset at the start of each branch. The numbers of any capturing
1134buffers that follow the subpattern start after the highest number used in any
1135branch. The following example is taken from the Perl documentation.
1136The numbers underneath show in which buffer the captured content will be
1137stored.
1138<pre>
1139  # before  ---------------branch-reset----------- after
1140  / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1141  # 1            2         2  3        2     3     4
1142</pre>
1143A backreference or a recursive call to a numbered subpattern always refers to
1144the first one in the pattern with the given number.
1145</P>
1146<P>
1147An alternative approach to using this "branch reset" feature is to use
1148duplicate named subpatterns, as described in the next section.
1149</P>
1150<br><a name="SEC14" href="#TOC1">NAMED SUBPATTERNS</a><br>
1151<P>
1152Identifying capturing parentheses by number is simple, but it can be very hard
1153to keep track of the numbers in complicated regular expressions. Furthermore,
1154if an expression is modified, the numbers may change. To help with this
1155difficulty, PCRE supports the naming of subpatterns. This feature was not
1156added to Perl until release 5.10. Python had the feature earlier, and PCRE
1157introduced it at release 4.0, using the Python syntax. PCRE now supports both
1158the Perl and the Python syntax.
1159</P>
1160<P>
1161In PCRE, a subpattern can be named in one of three ways: (?&#60;name&#62;...) or
1162(?'name'...) as in Perl, or (?P&#60;name&#62;...) as in Python. References to capturing
1163parentheses from other parts of the pattern, such as
1164<a href="#backreferences">backreferences,</a>
1165<a href="#recursion">recursion,</a>
1166and
1167<a href="#conditions">conditions,</a>
1168can be made by name as well as by number.
1169</P>
1170<P>
1171Names consist of up to 32 alphanumeric characters and underscores. Named
1172capturing parentheses are still allocated numbers as well as names, exactly as
1173if the names were not present. The PCRE API provides function calls for
1174extracting the name-to-number translation table from a compiled pattern. There
1175is also a convenience function for extracting a captured substring by name.
1176</P>
1177<P>
1178By default, a name must be unique within a pattern, but it is possible to relax
1179this constraint by setting the PCRE_DUPNAMES option at compile time. This can
1180be useful for patterns where only one instance of the named parentheses can
1181match. Suppose you want to match the name of a weekday, either as a 3-letter
1182abbreviation or as the full name, and in both cases you want to extract the
1183abbreviation. This pattern (ignoring the line breaks) does the job:
1184<pre>
1185  (?&#60;DN&#62;Mon|Fri|Sun)(?:day)?|
1186  (?&#60;DN&#62;Tue)(?:sday)?|
1187  (?&#60;DN&#62;Wed)(?:nesday)?|
1188  (?&#60;DN&#62;Thu)(?:rsday)?|
1189  (?&#60;DN&#62;Sat)(?:urday)?
1190</pre>
1191There are five capturing substrings, but only one is …

Large files files are truncated, but you can click here to view the full file