PageRenderTime 30ms CodeModel.GetById 16ms RepoModel.GetById 0ms app.codeStats 0ms

/vendor/cegui-0.4.0-custom/src/pcre/doc/html/pcrepattern.html

https://github.com/rameshj03/multitheftauto
HTML | 1149 lines | 1142 code | 7 blank | 0 comment | 0 complexity | 8e6e9b44b6f7174deadc56cbff9b5780 MD5 | raw file
  1. <html>
  2. <head>
  3. <title>pcrepattern specification</title>
  4. </head>
  5. <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
  6. <h1>pcrepattern man page</h1>
  7. <p>
  8. Return to the <a href="index.html">PCRE index page</a>.
  9. </p>
  10. <p>
  11. This page is part of the PCRE HTML documentation. It was generated automatically
  12. from the original man page. If there is any nonsense in it, please consult the
  13. man page, in case the conversion went wrong.
  14. <br>
  15. <ul>
  16. <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a>
  17. <li><a name="TOC2" href="#SEC2">BACKSLASH</a>
  18. <li><a name="TOC3" href="#SEC3">CIRCUMFLEX AND DOLLAR</a>
  19. <li><a name="TOC4" href="#SEC4">FULL STOP (PERIOD, DOT)</a>
  20. <li><a name="TOC5" href="#SEC5">MATCHING A SINGLE BYTE</a>
  21. <li><a name="TOC6" href="#SEC6">SQUARE BRACKETS AND CHARACTER CLASSES</a>
  22. <li><a name="TOC7" href="#SEC7">POSIX CHARACTER CLASSES</a>
  23. <li><a name="TOC8" href="#SEC8">VERTICAL BAR</a>
  24. <li><a name="TOC9" href="#SEC9">INTERNAL OPTION SETTING</a>
  25. <li><a name="TOC10" href="#SEC10">SUBPATTERNS</a>
  26. <li><a name="TOC11" href="#SEC11">NAMED SUBPATTERNS</a>
  27. <li><a name="TOC12" href="#SEC12">REPETITION</a>
  28. <li><a name="TOC13" href="#SEC13">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
  29. <li><a name="TOC14" href="#SEC14">BACK REFERENCES</a>
  30. <li><a name="TOC15" href="#SEC15">ASSERTIONS</a>
  31. <li><a name="TOC16" href="#SEC16">CONDITIONAL SUBPATTERNS</a>
  32. <li><a name="TOC17" href="#SEC17">COMMENTS</a>
  33. <li><a name="TOC18" href="#SEC18">RECURSIVE PATTERNS</a>
  34. <li><a name="TOC19" href="#SEC19">SUBPATTERNS AS SUBROUTINES</a>
  35. <li><a name="TOC20" href="#SEC20">CALLOUTS</a>
  36. </ul>
  37. <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
  38. <P>
  39. The syntax and semantics of the regular expressions supported by PCRE are
  40. described below. Regular expressions are also described in the Perl
  41. documentation and in a number of books, some of which have copious examples.
  42. Jeffrey Friedl's "Mastering Regular Expressions", published by O'Reilly, covers
  43. regular expressions in great detail. This description of PCRE's regular
  44. expressions is intended as reference material.
  45. </P>
  46. <P>
  47. The original operation of PCRE was on strings of one-byte characters. However,
  48. there is now also support for UTF-8 character strings. To use this, you must
  49. build PCRE to include UTF-8 support, and then call <b>pcre_compile()</b> with
  50. the PCRE_UTF8 option. How this affects pattern matching is mentioned in several
  51. places below. There is also a summary of UTF-8 features in the
  52. <a href="pcre.html#utf8support">section on UTF-8 support</a>
  53. in the main
  54. <a href="pcre.html"><b>pcre</b></a>
  55. page.
  56. </P>
  57. <P>
  58. A regular expression is a pattern that is matched against a subject string from
  59. left to right. Most characters stand for themselves in a pattern, and match the
  60. corresponding characters in the subject. As a trivial example, the pattern
  61. <pre>
  62. The quick brown fox
  63. </pre>
  64. matches a portion of a subject string that is identical to itself. The power of
  65. regular expressions comes from the ability to include alternatives and
  66. repetitions in the pattern. These are encoded in the pattern by the use of
  67. <i>metacharacters</i>, which do not stand for themselves but instead are
  68. interpreted in some special way.
  69. </P>
  70. <P>
  71. There are two different sets of metacharacters: those that are recognized
  72. anywhere in the pattern except within square brackets, and those that are
  73. recognized in square brackets. Outside square brackets, the metacharacters are
  74. as follows:
  75. <pre>
  76. \ general escape character with several uses
  77. ^ assert start of string (or line, in multiline mode)
  78. $ assert end of string (or line, in multiline mode)
  79. . match any character except newline (by default)
  80. [ start character class definition
  81. | start of alternative branch
  82. ( start subpattern
  83. ) end subpattern
  84. ? extends the meaning of (
  85. also 0 or 1 quantifier
  86. also quantifier minimizer
  87. * 0 or more quantifier
  88. + 1 or more quantifier
  89. also "possessive quantifier"
  90. { start min/max quantifier
  91. </pre>
  92. Part of a pattern that is in square brackets is called a "character class". In
  93. a character class the only metacharacters are:
  94. <pre>
  95. \ general escape character
  96. ^ negate the class, but only if the first character
  97. - indicates character range
  98. [ POSIX character class (only if followed by POSIX syntax)
  99. ] terminates the character class
  100. </pre>
  101. The following sections describe the use of each of the metacharacters.
  102. </P>
  103. <br><a name="SEC2" href="#TOC1">BACKSLASH</a><br>
  104. <P>
  105. The backslash character has several uses. Firstly, if it is followed by a
  106. non-alphanumeric character, it takes away any special meaning that character may
  107. have. This use of backslash as an escape character applies both inside and
  108. outside character classes.
  109. </P>
  110. <P>
  111. For example, if you want to match a * character, you write \* in the pattern.
  112. This escaping action applies whether or not the following character would
  113. otherwise be interpreted as a metacharacter, so it is always safe to precede a
  114. non-alphanumeric with backslash to specify that it stands for itself. In
  115. particular, if you want to match a backslash, you write \\.
  116. </P>
  117. <P>
  118. If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
  119. pattern (other than in a character class) and characters between a # outside
  120. a character class and the next newline character are ignored. An escaping
  121. backslash can be used to include a whitespace or # character as part of the
  122. pattern.
  123. </P>
  124. <P>
  125. If you want to remove the special meaning from a sequence of characters, you
  126. can do so by putting them between \Q and \E. This is different from Perl in
  127. that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in
  128. Perl, $ and @ cause variable interpolation. Note the following examples:
  129. <pre>
  130. Pattern PCRE matches Perl matches
  131. \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
  132. \Qabc\$xyz\E abc\$xyz abc\$xyz
  133. \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
  134. </pre>
  135. The \Q...\E sequence is recognized both inside and outside character classes.
  136. <a name="digitsafterbackslash"></a></P>
  137. <br><b>
  138. Non-printing characters
  139. </b><br>
  140. <P>
  141. A second use of backslash provides a way of encoding non-printing characters
  142. in patterns in a visible manner. There is no restriction on the appearance of
  143. non-printing characters, apart from the binary zero that terminates a pattern,
  144. but when a pattern is being prepared by text editing, it is usually easier to
  145. use one of the following escape sequences than the binary character it
  146. represents:
  147. <pre>
  148. \a alarm, that is, the BEL character (hex 07)
  149. \cx "control-x", where x is any character
  150. \e escape (hex 1B)
  151. \f formfeed (hex 0C)
  152. \n newline (hex 0A)
  153. \r carriage return (hex 0D)
  154. \t tab (hex 09)
  155. \ddd character with octal code ddd, or backreference
  156. \xhh character with hex code hh
  157. \x{hhh..} character with hex code hhh... (UTF-8 mode only)
  158. </pre>
  159. The precise effect of \cx is as follows: if x is a lower case letter, it
  160. is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
  161. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; becomes hex
  162. 7B.
  163. </P>
  164. <P>
  165. After \x, from zero to two hexadecimal digits are read (letters can be in
  166. upper or lower case). In UTF-8 mode, any number of hexadecimal digits may
  167. appear between \x{ and }, but the value of the character code must be less
  168. than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters
  169. other than hexadecimal digits appear between \x{ and }, or if there is no
  170. terminating }, this form of escape is not recognized. Instead, the initial
  171. \x will be interpreted as a basic hexadecimal escape, with no following
  172. digits, giving a character whose value is zero.
  173. </P>
  174. <P>
  175. Characters whose value is less than 256 can be defined by either of the two
  176. syntaxes for \x when PCRE is in UTF-8 mode. There is no difference in the
  177. way they are handled. For example, \xdc is exactly the same as \x{dc}.
  178. </P>
  179. <P>
  180. After \0 up to two further octal digits are read. In both cases, if there
  181. are fewer than two digits, just those that are present are used. Thus the
  182. sequence \0\x\07 specifies two binary zeros followed by a BEL character
  183. (code value 7). Make sure you supply two digits after the initial zero if the
  184. pattern character that follows is itself an octal digit.
  185. </P>
  186. <P>
  187. The handling of a backslash followed by a digit other than 0 is complicated.
  188. Outside a character class, PCRE reads it and any following digits as a decimal
  189. number. If the number is less than 10, or if there have been at least that many
  190. previous capturing left parentheses in the expression, the entire sequence is
  191. taken as a <i>back reference</i>. A description of how this works is given
  192. <a href="#backreferences">later,</a>
  193. following the discussion of
  194. <a href="#subpattern">parenthesized subpatterns.</a>
  195. </P>
  196. <P>
  197. Inside a character class, or if the decimal number is greater than 9 and there
  198. have not been that many capturing subpatterns, PCRE re-reads up to three octal
  199. digits following the backslash, and generates a single byte from the least
  200. significant 8 bits of the value. Any subsequent digits stand for themselves.
  201. For example:
  202. <pre>
  203. \040 is another way of writing a space
  204. \40 is the same, provided there are fewer than 40 previous capturing subpatterns
  205. \7 is always a back reference
  206. \11 might be a back reference, or another way of writing a tab
  207. \011 is always a tab
  208. \0113 is a tab followed by the character "3"
  209. \113 might be a back reference, otherwise the character with octal code 113
  210. \377 might be a back reference, otherwise the byte consisting entirely of 1 bits
  211. \81 is either a back reference, or a binary zero followed by the two characters "8" and "1"
  212. </pre>
  213. Note that octal values of 100 or greater must not be introduced by a leading
  214. zero, because no more than three octal digits are ever read.
  215. </P>
  216. <P>
  217. All the sequences that define a single byte value or a single UTF-8 character
  218. (in UTF-8 mode) can be used both inside and outside character classes. In
  219. addition, inside a character class, the sequence \b is interpreted as the
  220. backspace character (hex 08), and the sequence \X is interpreted as the
  221. character "X". Outside a character class, these sequences have different
  222. meanings
  223. <a href="#uniextseq">(see below).</a>
  224. </P>
  225. <br><b>
  226. Generic character types
  227. </b><br>
  228. <P>
  229. The third use of backslash is for specifying generic character types. The
  230. following are always recognized:
  231. <pre>
  232. \d any decimal digit
  233. \D any character that is not a decimal digit
  234. \s any whitespace character
  235. \S any character that is not a whitespace character
  236. \w any "word" character
  237. \W any "non-word" character
  238. </pre>
  239. Each pair of escape sequences partitions the complete set of characters into
  240. two disjoint sets. Any given character matches one, and only one, of each pair.
  241. </P>
  242. <P>
  243. These character type sequences can appear both inside and outside character
  244. classes. They each match one character of the appropriate type. If the current
  245. matching point is at the end of the subject string, all of them fail, since
  246. there is no character to match.
  247. </P>
  248. <P>
  249. For compatibility with Perl, \s does not match the VT character (code 11).
  250. This makes it different from the the POSIX "space" class. The \s characters
  251. are HT (9), LF (10), FF (12), CR (13), and space (32).
  252. </P>
  253. <P>
  254. A "word" character is an underscore or any character less than 256 that is a
  255. letter or digit. The definition of letters and digits is controlled by PCRE's
  256. low-valued character tables, and may vary if locale-specific matching is taking
  257. place (see
  258. <a href="pcreapi.html#localesupport">"Locale support"</a>
  259. in the
  260. <a href="pcreapi.html"><b>pcreapi</b></a>
  261. page). For example, in the "fr_FR" (French) locale, some character codes
  262. greater than 128 are used for accented letters, and these are matched by \w.
  263. </P>
  264. <P>
  265. In UTF-8 mode, characters with values greater than 128 never match \d, \s, or
  266. \w, and always match \D, \S, and \W. This is true even when Unicode
  267. character property support is available.
  268. <a name="uniextseq"></a></P>
  269. <br><b>
  270. Unicode character properties
  271. </b><br>
  272. <P>
  273. When PCRE is built with Unicode character property support, three additional
  274. escape sequences to match generic character types are available when UTF-8 mode
  275. is selected. They are:
  276. <pre>
  277. \p{<i>xx</i>} a character with the <i>xx</i> property
  278. \P{<i>xx</i>} a character without the <i>xx</i> property
  279. \X an extended Unicode sequence
  280. </pre>
  281. The property names represented by <i>xx</i> above are limited to the
  282. Unicode general category properties. Each character has exactly one such
  283. property, specified by a two-letter abbreviation. For compatibility with Perl,
  284. negation can be specified by including a circumflex between the opening brace
  285. and the property name. For example, \p{^Lu} is the same as \P{Lu}.
  286. </P>
  287. <P>
  288. If only one letter is specified with \p or \P, it includes all the properties
  289. that start with that letter. In this case, in the absence of negation, the
  290. curly brackets in the escape sequence are optional; these two examples have
  291. the same effect:
  292. <pre>
  293. \p{L}
  294. \pL
  295. </pre>
  296. The following property codes are supported:
  297. <pre>
  298. C Other
  299. Cc Control
  300. Cf Format
  301. Cn Unassigned
  302. Co Private use
  303. Cs Surrogate
  304. L Letter
  305. Ll Lower case letter
  306. Lm Modifier letter
  307. Lo Other letter
  308. Lt Title case letter
  309. Lu Upper case letter
  310. M Mark
  311. Mc Spacing mark
  312. Me Enclosing mark
  313. Mn Non-spacing mark
  314. N Number
  315. Nd Decimal number
  316. Nl Letter number
  317. No Other number
  318. P Punctuation
  319. Pc Connector punctuation
  320. Pd Dash punctuation
  321. Pe Close punctuation
  322. Pf Final punctuation
  323. Pi Initial punctuation
  324. Po Other punctuation
  325. Ps Open punctuation
  326. S Symbol
  327. Sc Currency symbol
  328. Sk Modifier symbol
  329. Sm Mathematical symbol
  330. So Other symbol
  331. Z Separator
  332. Zl Line separator
  333. Zp Paragraph separator
  334. Zs Space separator
  335. </pre>
  336. Extended properties such as "Greek" or "InMusicalSymbols" are not supported by
  337. PCRE.
  338. </P>
  339. <P>
  340. Specifying caseless matching does not affect these escape sequences. For
  341. example, \p{Lu} always matches only upper case letters.
  342. </P>
  343. <P>
  344. The \X escape matches any number of Unicode characters that form an extended
  345. Unicode sequence. \X is equivalent to
  346. <pre>
  347. (?&#62;\PM\pM*)
  348. </pre>
  349. That is, it matches a character without the "mark" property, followed by zero
  350. or more characters with the "mark" property, and treats the sequence as an
  351. atomic group
  352. <a href="#atomicgroup">(see below).</a>
  353. Characters with the "mark" property are typically accents that affect the
  354. preceding character.
  355. </P>
  356. <P>
  357. Matching characters by Unicode property is not fast, because PCRE has to search
  358. a structure that contains data for over fifteen thousand characters. That is
  359. why the traditional escape sequences such as \d and \w do not use Unicode
  360. properties in PCRE.
  361. <a name="smallassertions"></a></P>
  362. <br><b>
  363. Simple assertions
  364. </b><br>
  365. <P>
  366. The fourth use of backslash is for certain simple assertions. An assertion
  367. specifies a condition that has to be met at a particular point in a match,
  368. without consuming any characters from the subject string. The use of
  369. subpatterns for more complicated assertions is described
  370. <a href="#bigassertions">below.</a>
  371. The backslashed
  372. assertions are:
  373. <pre>
  374. \b matches at a word boundary
  375. \B matches when not at a word boundary
  376. \A matches at start of subject
  377. \Z matches at end of subject or before newline at end
  378. \z matches at end of subject
  379. \G matches at first matching position in subject
  380. </pre>
  381. These assertions may not appear in character classes (but note that \b has a
  382. different meaning, namely the backspace character, inside a character class).
  383. </P>
  384. <P>
  385. A word boundary is a position in the subject string where the current character
  386. and the previous character do not both match \w or \W (i.e. one matches
  387. \w and the other matches \W), or the start or end of the string if the
  388. first or last character matches \w, respectively.
  389. </P>
  390. <P>
  391. The \A, \Z, and \z assertions differ from the traditional circumflex and
  392. dollar (described in the next section) in that they only ever match at the very
  393. start and end of the subject string, whatever options are set. Thus, they are
  394. independent of multiline mode. These three assertions are not affected by the
  395. PCRE_NOTBOL or PCRE_NOTEOL options, which affect only the behaviour of the
  396. circumflex and dollar metacharacters. However, if the <i>startoffset</i>
  397. argument of <b>pcre_exec()</b> is non-zero, indicating that matching is to start
  398. at a point other than the beginning of the subject, \A can never match. The
  399. difference between \Z and \z is that \Z matches before a newline that is the
  400. last character of the string as well as at the end of the string, whereas \z
  401. matches only at the end.
  402. </P>
  403. <P>
  404. The \G assertion is true only when the current matching position is at the
  405. start point of the match, as specified by the <i>startoffset</i> argument of
  406. <b>pcre_exec()</b>. It differs from \A when the value of <i>startoffset</i> is
  407. non-zero. By calling <b>pcre_exec()</b> multiple times with appropriate
  408. arguments, you can mimic Perl's /g option, and it is in this kind of
  409. implementation where \G can be useful.
  410. </P>
  411. <P>
  412. Note, however, that PCRE's interpretation of \G, as the start of the current
  413. match, is subtly different from Perl's, which defines it as the end of the
  414. previous match. In Perl, these can be different when the previously matched
  415. string was empty. Because PCRE does just one match at a time, it cannot
  416. reproduce this behaviour.
  417. </P>
  418. <P>
  419. If all the alternatives of a pattern begin with \G, the expression is anchored
  420. to the starting match position, and the "anchored" flag is set in the compiled
  421. regular expression.
  422. </P>
  423. <br><a name="SEC3" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
  424. <P>
  425. Outside a character class, in the default matching mode, the circumflex
  426. character is an assertion that is true only if the current matching point is
  427. at the start of the subject string. If the <i>startoffset</i> argument of
  428. <b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE
  429. option is unset. Inside a character class, circumflex has an entirely different
  430. meaning
  431. <a href="#characterclass">(see below).</a>
  432. </P>
  433. <P>
  434. Circumflex need not be the first character of the pattern if a number of
  435. alternatives are involved, but it should be the first thing in each alternative
  436. in which it appears if the pattern is ever to match that branch. If all
  437. possible alternatives start with a circumflex, that is, if the pattern is
  438. constrained to match only at the start of the subject, it is said to be an
  439. "anchored" pattern. (There are also other constructs that can cause a pattern
  440. to be anchored.)
  441. </P>
  442. <P>
  443. A dollar character is an assertion that is true only if the current matching
  444. point is at the end of the subject string, or immediately before a newline
  445. character that is the last character in the string (by default). Dollar need
  446. not be the last character of the pattern if a number of alternatives are
  447. involved, but it should be the last item in any branch in which it appears.
  448. Dollar has no special meaning in a character class.
  449. </P>
  450. <P>
  451. The meaning of dollar can be changed so that it matches only at the very end of
  452. the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This
  453. does not affect the \Z assertion.
  454. </P>
  455. <P>
  456. The meanings of the circumflex and dollar characters are changed if the
  457. PCRE_MULTILINE option is set. When this is the case, they match immediately
  458. after and immediately before an internal newline character, respectively, in
  459. addition to matching at the start and end of the subject string. For example,
  460. the pattern /^abc$/ matches the subject string "def\nabc" (where \n
  461. represents a newline character) in multiline mode, but not otherwise.
  462. Consequently, patterns that are anchored in single line mode because all
  463. branches start with ^ are not anchored in multiline mode, and a match for
  464. circumflex is possible when the <i>startoffset</i> argument of <b>pcre_exec()</b>
  465. is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
  466. set.
  467. </P>
  468. <P>
  469. Note that the sequences \A, \Z, and \z can be used to match the start and
  470. end of the subject in both modes, and if all branches of a pattern start with
  471. \A it is always anchored, whether PCRE_MULTILINE is set or not.
  472. </P>
  473. <br><a name="SEC4" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br>
  474. <P>
  475. Outside a character class, a dot in the pattern matches any one character in
  476. the subject, including a non-printing character, but not (by default) newline.
  477. In UTF-8 mode, a dot matches any UTF-8 character, which might be more than one
  478. byte long, except (by default) newline. If the PCRE_DOTALL option is set,
  479. dots match newlines as well. The handling of dot is entirely independent of the
  480. handling of circumflex and dollar, the only relationship being that they both
  481. involve newline characters. Dot has no special meaning in a character class.
  482. </P>
  483. <br><a name="SEC5" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
  484. <P>
  485. Outside a character class, the escape sequence \C matches any one byte, both
  486. in and out of UTF-8 mode. Unlike a dot, it can match a newline. The feature is
  487. provided in Perl in order to match individual bytes in UTF-8 mode. Because it
  488. breaks up UTF-8 characters into individual bytes, what remains in the string
  489. may be a malformed UTF-8 string. For this reason, the \C escape sequence is
  490. best avoided.
  491. </P>
  492. <P>
  493. PCRE does not allow \C to appear in lookbehind assertions
  494. <a href="#lookbehind">(described below),</a>
  495. because in UTF-8 mode this would make it impossible to calculate the length of
  496. the lookbehind.
  497. <a name="characterclass"></a></P>
  498. <br><a name="SEC6" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
  499. <P>
  500. An opening square bracket introduces a character class, terminated by a closing
  501. square bracket. A closing square bracket on its own is not special. If a
  502. closing square bracket is required as a member of the class, it should be the
  503. first data character in the class (after an initial circumflex, if present) or
  504. escaped with a backslash.
  505. </P>
  506. <P>
  507. A character class matches a single character in the subject. In UTF-8 mode, the
  508. character may occupy more than one byte. A matched character must be in the set
  509. of characters defined by the class, unless the first character in the class
  510. definition is a circumflex, in which case the subject character must not be in
  511. the set defined by the class. If a circumflex is actually required as a member
  512. of the class, ensure it is not the first character, or escape it with a
  513. backslash.
  514. </P>
  515. <P>
  516. For example, the character class [aeiou] matches any lower case vowel, while
  517. [^aeiou] matches any character that is not a lower case vowel. Note that a
  518. circumflex is just a convenient notation for specifying the characters that
  519. are in the class by enumerating those that are not. A class that starts with a
  520. circumflex is not an assertion: it still consumes a character from the subject
  521. string, and therefore it fails if the current pointer is at the end of the
  522. string.
  523. </P>
  524. <P>
  525. In UTF-8 mode, characters with values greater than 255 can be included in a
  526. class as a literal string of bytes, or by using the \x{ escaping mechanism.
  527. </P>
  528. <P>
  529. When caseless matching is set, any letters in a class represent both their
  530. upper case and lower case versions, so for example, a caseless [aeiou] matches
  531. "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
  532. caseful version would. When running in UTF-8 mode, PCRE supports the concept of
  533. case for characters with values greater than 128 only when it is compiled with
  534. Unicode property support.
  535. </P>
  536. <P>
  537. The newline character is never treated in any special way in character classes,
  538. whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class
  539. such as [^a] will always match a newline.
  540. </P>
  541. <P>
  542. The minus (hyphen) character can be used to specify a range of characters in a
  543. character class. For example, [d-m] matches any letter between d and m,
  544. inclusive. If a minus character is required in a class, it must be escaped with
  545. a backslash or appear in a position where it cannot be interpreted as
  546. indicating a range, typically as the first or last character in the class.
  547. </P>
  548. <P>
  549. It is not possible to have the literal character "]" as the end character of a
  550. range. A pattern such as [W-]46] is interpreted as a class of two characters
  551. ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
  552. "-46]". However, if the "]" is escaped with a backslash it is interpreted as
  553. the end of range, so [W-\]46] is interpreted as a class containing a range
  554. followed by two other characters. The octal or hexadecimal representation of
  555. "]" can also be used to end a range.
  556. </P>
  557. <P>
  558. Ranges operate in the collating sequence of character values. They can also be
  559. used for characters specified numerically, for example [\000-\037]. In UTF-8
  560. mode, ranges can include characters whose values are greater than 255, for
  561. example [\x{100}-\x{2ff}].
  562. </P>
  563. <P>
  564. If a range that includes letters is used when caseless matching is set, it
  565. matches the letters in either case. For example, [W-c] is equivalent to
  566. [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character
  567. tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches accented E
  568. characters in both cases. In UTF-8 mode, PCRE supports the concept of case for
  569. characters with values greater than 128 only when it is compiled with Unicode
  570. property support.
  571. </P>
  572. <P>
  573. The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
  574. in a character class, and add the characters that they match to the class. For
  575. example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
  576. conveniently be used with the upper case character types to specify a more
  577. restricted set of characters than the matching lower case type. For example,
  578. the class [^\W_] matches any letter or digit, but not underscore.
  579. </P>
  580. <P>
  581. The only metacharacters that are recognized in character classes are backslash,
  582. hyphen (only where it can be interpreted as specifying a range), circumflex
  583. (only at the start), opening square bracket (only when it can be interpreted as
  584. introducing a POSIX class name - see the next section), and the terminating
  585. closing square bracket. However, escaping other non-alphanumeric characters
  586. does no harm.
  587. </P>
  588. <br><a name="SEC7" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
  589. <P>
  590. Perl supports the POSIX notation for character classes. This uses names
  591. enclosed by [: and :] within the enclosing square brackets. PCRE also supports
  592. this notation. For example,
  593. <pre>
  594. [01[:alpha:]%]
  595. </pre>
  596. matches "0", "1", any alphabetic character, or "%". The supported class names
  597. are
  598. <pre>
  599. alnum letters and digits
  600. alpha letters
  601. ascii character codes 0 - 127
  602. blank space or tab only
  603. cntrl control characters
  604. digit decimal digits (same as \d)
  605. graph printing characters, excluding space
  606. lower lower case letters
  607. print printing characters, including space
  608. punct printing characters, excluding letters and digits
  609. space white space (not quite the same as \s)
  610. upper upper case letters
  611. word "word" characters (same as \w)
  612. xdigit hexadecimal digits
  613. </pre>
  614. The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
  615. space (32). Notice that this list includes the VT character (code 11). This
  616. makes "space" different to \s, which does not include VT (for Perl
  617. compatibility).
  618. </P>
  619. <P>
  620. The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
  621. 5.8. Another Perl extension is negation, which is indicated by a ^ character
  622. after the colon. For example,
  623. <pre>
  624. [12[:^digit:]]
  625. </pre>
  626. matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
  627. syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
  628. supported, and an error is given if they are encountered.
  629. </P>
  630. <P>
  631. In UTF-8 mode, characters with values greater than 128 do not match any of
  632. the POSIX character classes.
  633. </P>
  634. <br><a name="SEC8" href="#TOC1">VERTICAL BAR</a><br>
  635. <P>
  636. Vertical bar characters are used to separate alternative patterns. For example,
  637. the pattern
  638. <pre>
  639. gilbert|sullivan
  640. </pre>
  641. matches either "gilbert" or "sullivan". Any number of alternatives may appear,
  642. and an empty alternative is permitted (matching the empty string).
  643. The matching process tries each alternative in turn, from left to right,
  644. and the first one that succeeds is used. If the alternatives are within a
  645. subpattern
  646. <a href="#subpattern">(defined below),</a>
  647. "succeeds" means matching the rest of the main pattern as well as the
  648. alternative in the subpattern.
  649. </P>
  650. <br><a name="SEC9" href="#TOC1">INTERNAL OPTION SETTING</a><br>
  651. <P>
  652. The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
  653. PCRE_EXTENDED options can be changed from within the pattern by a sequence of
  654. Perl option letters enclosed between "(?" and ")". The option letters are
  655. <pre>
  656. i for PCRE_CASELESS
  657. m for PCRE_MULTILINE
  658. s for PCRE_DOTALL
  659. x for PCRE_EXTENDED
  660. </pre>
  661. For example, (?im) sets caseless, multiline matching. It is also possible to
  662. unset these options by preceding the letter with a hyphen, and a combined
  663. setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
  664. PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
  665. permitted. If a letter appears both before and after the hyphen, the option is
  666. unset.
  667. </P>
  668. <P>
  669. When an option change occurs at top level (that is, not inside subpattern
  670. parentheses), the change applies to the remainder of the pattern that follows.
  671. If the change is placed right at the start of a pattern, PCRE extracts it into
  672. the global options (and it will therefore show up in data extracted by the
  673. <b>pcre_fullinfo()</b> function).
  674. </P>
  675. <P>
  676. An option change within a subpattern affects only that part of the current
  677. pattern that follows it, so
  678. <pre>
  679. (a(?i)b)c
  680. </pre>
  681. matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
  682. By this means, options can be made to have different settings in different
  683. parts of the pattern. Any changes made in one alternative do carry on
  684. into subsequent branches within the same subpattern. For example,
  685. <pre>
  686. (a(?i)b|c)
  687. </pre>
  688. matches "ab", "aB", "c", and "C", even though when matching "C" the first
  689. branch is abandoned before the option setting. This is because the effects of
  690. option settings happen at compile time. There would be some very weird
  691. behaviour otherwise.
  692. </P>
  693. <P>
  694. The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in the
  695. same way as the Perl-compatible options by using the characters U and X
  696. respectively. The (?X) flag setting is special in that it must always occur
  697. earlier in the pattern than any of the additional features it turns on, even
  698. when it is at top level. It is best to put it at the start.
  699. <a name="subpattern"></a></P>
  700. <br><a name="SEC10" href="#TOC1">SUBPATTERNS</a><br>
  701. <P>
  702. Subpatterns are delimited by parentheses (round brackets), which can be nested.
  703. Turning part of a pattern into a subpattern does two things:
  704. <br>
  705. <br>
  706. 1. It localizes a set of alternatives. For example, the pattern
  707. <pre>
  708. cat(aract|erpillar|)
  709. </pre>
  710. matches one of the words "cat", "cataract", or "caterpillar". Without the
  711. parentheses, it would match "cataract", "erpillar" or the empty string.
  712. <br>
  713. <br>
  714. 2. It sets up the subpattern as a capturing subpattern. This means that, when
  715. the whole pattern matches, that portion of the subject string that matched the
  716. subpattern is passed back to the caller via the <i>ovector</i> argument of
  717. <b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting
  718. from 1) to obtain numbers for the capturing subpatterns.
  719. </P>
  720. <P>
  721. For example, if the string "the red king" is matched against the pattern
  722. <pre>
  723. the ((red|white) (king|queen))
  724. </pre>
  725. the captured substrings are "red king", "red", and "king", and are numbered 1,
  726. 2, and 3, respectively.
  727. </P>
  728. <P>
  729. The fact that plain parentheses fulfil two functions is not always helpful.
  730. There are often times when a grouping subpattern is required without a
  731. capturing requirement. If an opening parenthesis is followed by a question mark
  732. and a colon, the subpattern does not do any capturing, and is not counted when
  733. computing the number of any subsequent capturing subpatterns. For example, if
  734. the string "the white queen" is matched against the pattern
  735. <pre>
  736. the ((?:red|white) (king|queen))
  737. </pre>
  738. the captured substrings are "white queen" and "queen", and are numbered 1 and
  739. 2. The maximum number of capturing subpatterns is 65535, and the maximum depth
  740. of nesting of all subpatterns, both capturing and non-capturing, is 200.
  741. </P>
  742. <P>
  743. As a convenient shorthand, if any option settings are required at the start of
  744. a non-capturing subpattern, the option letters may appear between the "?" and
  745. the ":". Thus the two patterns
  746. <pre>
  747. (?i:saturday|sunday)
  748. (?:(?i)saturday|sunday)
  749. </pre>
  750. match exactly the same set of strings. Because alternative branches are tried
  751. from left to right, and options are not reset until the end of the subpattern
  752. is reached, an option setting in one branch does affect subsequent branches, so
  753. the above patterns match "SUNDAY" as well as "Saturday".
  754. </P>
  755. <br><a name="SEC11" href="#TOC1">NAMED SUBPATTERNS</a><br>
  756. <P>
  757. Identifying capturing parentheses by number is simple, but it can be very hard
  758. to keep track of the numbers in complicated regular expressions. Furthermore,
  759. if an expression is modified, the numbers may change. To help with this
  760. difficulty, PCRE supports the naming of subpatterns, something that Perl does
  761. not provide. The Python syntax (?P&#60;name&#62;...) is used. Names consist of
  762. alphanumeric characters and underscores, and must be unique within a pattern.
  763. </P>
  764. <P>
  765. Named capturing parentheses are still allocated numbers as well as names. The
  766. PCRE API provides function calls for extracting the name-to-number translation
  767. table from a compiled pattern. There is also a convenience function for
  768. extracting a captured substring by name. For further details see the
  769. <a href="pcreapi.html"><b>pcreapi</b></a>
  770. documentation.
  771. </P>
  772. <br><a name="SEC12" href="#TOC1">REPETITION</a><br>
  773. <P>
  774. Repetition is specified by quantifiers, which can follow any of the following
  775. items:
  776. <pre>
  777. a literal data character
  778. the . metacharacter
  779. the \C escape sequence
  780. the \X escape sequence (in UTF-8 mode with Unicode properties)
  781. an escape such as \d that matches a single character
  782. a character class
  783. a back reference (see next section)
  784. a parenthesized subpattern (unless it is an assertion)
  785. </pre>
  786. The general repetition quantifier specifies a minimum and maximum number of
  787. permitted matches, by giving the two numbers in curly brackets (braces),
  788. separated by a comma. The numbers must be less than 65536, and the first must
  789. be less than or equal to the second. For example:
  790. <pre>
  791. z{2,4}
  792. </pre>
  793. matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
  794. character. If the second number is omitted, but the comma is present, there is
  795. no upper limit; if the second number and the comma are both omitted, the
  796. quantifier specifies an exact number of required matches. Thus
  797. <pre>
  798. [aeiou]{3,}
  799. </pre>
  800. matches at least 3 successive vowels, but may match many more, while
  801. <pre>
  802. \d{8}
  803. </pre>
  804. matches exactly 8 digits. An opening curly bracket that appears in a position
  805. where a quantifier is not allowed, or one that does not match the syntax of a
  806. quantifier, is taken as a literal character. For example, {,6} is not a
  807. quantifier, but a literal string of four characters.
  808. </P>
  809. <P>
  810. In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual
  811. bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of
  812. which is represented by a two-byte sequence. Similarly, when Unicode property
  813. support is available, \X{3} matches three Unicode extended sequences, each of
  814. which may be several bytes long (and they may be of different lengths).
  815. </P>
  816. <P>
  817. The quantifier {0} is permitted, causing the expression to behave as if the
  818. previous item and the quantifier were not present.
  819. </P>
  820. <P>
  821. For convenience (and historical compatibility) the three most common
  822. quantifiers have single-character abbreviations:
  823. <pre>
  824. * is equivalent to {0,}
  825. + is equivalent to {1,}
  826. ? is equivalent to {0,1}
  827. </pre>
  828. It is possible to construct infinite loops by following a subpattern that can
  829. match no characters with a quantifier that has no upper limit, for example:
  830. <pre>
  831. (a?)*
  832. </pre>
  833. Earlier versions of Perl and PCRE used to give an error at compile time for
  834. such patterns. However, because there are cases where this can be useful, such
  835. patterns are now accepted, but if any repetition of the subpattern does in fact
  836. match no characters, the loop is forcibly broken.
  837. </P>
  838. <P>
  839. By default, the quantifiers are "greedy", that is, they match as much as
  840. possible (up to the maximum number of permitted times), without causing the
  841. rest of the pattern to fail. The classic example of where this gives problems
  842. is in trying to match comments in C programs. These appear between /* and */
  843. and within the comment, individual * and / characters may appear. An attempt to
  844. match C comments by applying the pattern
  845. <pre>
  846. /\*.*\*/
  847. </pre>
  848. to the string
  849. <pre>
  850. /* first comment */ not comment /* second comment */
  851. </pre>
  852. fails, because it matches the entire string owing to the greediness of the .*
  853. item.
  854. </P>
  855. <P>
  856. However, if a quantifier is followed by a question mark, it ceases to be
  857. greedy, and instead matches the minimum number of times possible, so the
  858. pattern
  859. <pre>
  860. /\*.*?\*/
  861. </pre>
  862. does the right thing with the C comments. The meaning of the various
  863. quantifiers is not otherwise changed, just the preferred number of matches.
  864. Do not confuse this use of question mark with its use as a quantifier in its
  865. own right. Because it has two uses, it can sometimes appear doubled, as in
  866. <pre>
  867. \d??\d
  868. </pre>
  869. which matches one digit by preference, but can match two if that is the only
  870. way the rest of the pattern matches.
  871. </P>
  872. <P>
  873. If the PCRE_UNGREEDY option is set (an option which is not available in Perl),
  874. the quantifiers are not greedy by default, but individual ones can be made
  875. greedy by following them with a question mark. In other words, it inverts the
  876. default behaviour.
  877. </P>
  878. <P>
  879. When a parenthesized subpattern is quantified with a minimum repeat count that
  880. is greater than 1 or with a limited maximum, more memory is required for the
  881. compiled pattern, in proportion to the size of the minimum or maximum.
  882. </P>
  883. <P>
  884. If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
  885. to Perl's /s) is set, thus allowing the . to match newlines, the pattern is
  886. implicitly anchored, because whatever follows will be tried against every
  887. character position in the subject string, so there is no point in retrying the
  888. overall match at any position after the first. PCRE normally treats such a
  889. pattern as though it were preceded by \A.
  890. </P>
  891. <P>
  892. In cases where it is known that the subject string contains no newlines, it is
  893. worth setting PCRE_DOTALL in order to obtain this optimization, or
  894. alternatively using ^ to indicate anchoring explicitly.
  895. </P>
  896. <P>
  897. However, there is one situation where the optimization cannot be used. When .*
  898. is inside capturing parentheses that are the subject of a backreference
  899. elsewhere in the pattern, a match at the start may fail, and a later one
  900. succeed. Consider, for example:
  901. <pre>
  902. (.*)abc\1
  903. </pre>
  904. If the subject is "xyz123abc123" the match point is the fourth character. For
  905. this reason, such a pattern is not implicitly anchored.
  906. </P>
  907. <P>
  908. When a capturing subpattern is repeated, the value captured is the substring
  909. that matched the final iteration. For example, after
  910. <pre>
  911. (tweedle[dume]{3}\s*)+
  912. </pre>
  913. has matched "tweedledum tweedledee" the value of the captured substring is
  914. "tweedledee". However, if there are nested capturing subpatterns, the
  915. corresponding captured values may have been set in previous iterations. For
  916. example, after
  917. <pre>
  918. /(a|(b))+/
  919. </pre>
  920. matches "aba" the value of the second captured substring is "b".
  921. <a name="atomicgroup"></a></P>
  922. <br><a name="SEC13" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
  923. <P>
  924. With both maximizing and minimizing repetition, failure of what follows
  925. normally causes the repeated item to be re-evaluated to see if a different
  926. number of repeats allows the rest of the pattern to match. Sometimes it is
  927. useful to prevent this, either to change the nature of the match, or to cause
  928. it fail earlier than it otherwise might, when the author of the pattern knows
  929. there is no point in carrying on.
  930. </P>
  931. <P>
  932. Consider, for example, the pattern \d+foo when applied to the subject line
  933. <pre>
  934. 123456bar
  935. </pre>
  936. After matching all 6 digits and then failing to match "foo", the normal
  937. action of the matcher is to try again with only 5 digits matching the \d+
  938. item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
  939. (a term taken from Jeffrey Friedl's book) provides the means for specifying
  940. that once a subpattern has matched, it is not to be re-evaluated in this way.
  941. </P>
  942. <P>
  943. If we use atomic grouping for the previous example, the matcher would give up
  944. immediately on failing to match "foo" the first time. The notation is a kind of
  945. special parenthesis, starting with (?&#62; as in this example:
  946. <pre>
  947. (?&#62;\d+)foo
  948. </pre>
  949. This kind of parenthesis "locks up" the part of the pattern it contains once
  950. it has matched, and a failure further into the pattern is prevented from
  951. backtracking into it. Backtracking past it to previous items, however, works as
  952. normal.
  953. </P>
  954. <P>
  955. An alternative description is that a subpattern of this type matches the string
  956. of characters that an identical standalone pattern would match, if anchored at
  957. the current point in the subject string.
  958. </P>
  959. <P>
  960. Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
  961. the above example can be thought of as a maximizing repeat that must swallow
  962. everything it can. So, while both \d+ and \d+? are prepared to adjust the
  963. number of digits they match in order to make the rest of the pattern match,
  964. (?&#62;\d+) can only match an entire sequence of digits.
  965. </P>
  966. <P>
  967. Atomic groups in general can of course contain arbitrarily complicated
  968. subpatterns, and can be nested. However, when the subpattern for an atomic
  969. group is just a single repeated item, as in the example above, a simpler
  970. notation, called a "possessive quantifier" can be used. This consists of an
  971. additional + character following a quantifier. Using this notation, the
  972. previous example can be rewritten as
  973. <pre>
  974. \d++foo
  975. </pre>
  976. Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY
  977. option is ignored. They are a convenient notation for the simpler forms of
  978. atomic group. However, there is no difference in the meaning or processing of a
  979. possessive quantifier and the equivalent atomic group.
  980. </P>
  981. <P>
  982. The possessive quantifier syntax is an extension to the Perl syntax. It
  983. originates in Sun's Java package.
  984. </P>
  985. <P>
  986. When a pattern contains an unlimited repeat inside a subpattern that can itself
  987. be repeated an unlimited number of times, the use of an atomic group is the
  988. only way to avoid some failing matches taking a very long time indeed. The
  989. pattern
  990. <pre>
  991. (\D+|&#60;\d+&#62;)*[!?]
  992. </pre>
  993. matches an unlimited number of substrings that either consist of non-digits, or
  994. digits enclosed in &#60;&#62;, followed by either ! or ?. When it matches, it runs
  995. quickly. However, if it is applied to
  996. <pre>
  997. aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
  998. </pre>
  999. it takes a long time before reporting failure. This is because the string can
  1000. be divided between the internal \D+ repeat and the external * repeat in a
  1001. large number of ways, and all have to be tried. (The example uses [!?] rather
  1002. than a single character at the end, because both PCRE and Perl have an
  1003. optimization that allows for fast failure when a single character is used. They
  1004. remember the last single character that is required for a match, and fail early
  1005. if it is not present in the string.) If the pattern is changed so that it uses
  1006. an atomic group, like this:
  1007. <pre>
  1008. ((?&#62;\D+)|&#60;\d+&#62;)*[!?]
  1009. </pre>
  1010. sequences of non-digits cannot be broken, and failure happens quickly.
  1011. <a name="backreferences"></a></P>
  1012. <br><a name="SEC14" href="#TOC1">BACK REFERENCES</a><br>
  1013. <P>
  1014. Outside a character class, a backslash followed by a digit greater than 0 (and
  1015. possibly further digits) is a back reference to a capturing subpattern earlier
  1016. (that is, to its left) in the pattern, provided there have been that many
  1017. previous capturing left parentheses.
  1018. </P>
  1019. <P>
  1020. However, if the decimal number following the backslash is less than 10, it is
  1021. always taken as a back reference, and causes an error only if there are not
  1022. that many capturing left parentheses in the entire pattern. In other words, the
  1023. parentheses that are referenced need not be to the left of the reference for
  1024. numbers less than 10. See the subsection entitled "Non-printing characters"
  1025. <a href="#digitsafterbackslash">above</a>
  1026. for further details of the handling of digits following a backslash.
  1027. </P>
  1028. <P>
  1029. A back reference matches whatever actually matched the capturing subpattern in
  1030. the current subject string, rather than anything matching the subpattern
  1031. itself (see
  1032. <a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
  1033. below for a way of doing that). So the pattern
  1034. <pre>
  1035. (sens|respons)e and \1ibility
  1036. </pre>
  1037. matches "sense and sensibility" and "response and responsibility", but not
  1038. "sense and responsibility". If caseful matching is in force at the time of the
  1039. back reference, the case of letters is relevant. For example,
  1040. <pre>
  1041. ((?i)rah)\s+\1
  1042. </pre>
  1043. matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
  1044. capturing subpattern is matched caselessly.
  1045. </P>
  1046. <P>
  1047. Back references to named subpatterns use the Python syntax (?P=name). We could
  1048. rewrite the above example as follows:
  1049. <pre>
  1050. (?&#60;p1&#62;(?i)rah)\s+(?P=p1)
  1051. </pre>
  1052. There may be more than one back reference to the same subpattern. If a
  1053. subpattern has not actually been used in a particular match, any back
  1054. references to it always fail. For example, the pattern
  1055. <pre>
  1056. (a|(bc))\2
  1057. </pre>
  1058. always fails if it starts to match "a" rather than "bc". Because there may be
  1059. many capturing parentheses in a pattern, all digits following the backslash are
  1060. taken as part of a potential back reference number. If the pattern continues
  1061. with a digit character, some delimiter must be used to terminate the back
  1062. reference. If the PCRE_EXTENDED option is set, this can be whitespace.
  1063. Otherwise an empty comment (see
  1064. <a href="#comments">"Comments"</a>
  1065. below) can be used.
  1066. </P>
  1067. <P>
  1068. A back reference that occurs inside the parentheses to which it refers fails
  1069. when the subpattern is first used, so, for example, (a\1) never matches.
  1070. However, such references can be useful inside repeated subpatterns. For
  1071. example, the pattern
  1072. <pre>
  1073. (a|b\1)+
  1074. </pre>
  1075. matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
  1076. the subpattern, the back reference matches the character string corresponding
  1077. to the previous iteration. In order for this to work, the pattern must be such
  1078. that the first iteration does not need to match the back reference. This can be
  1079. done using alternation, as in the example above, or by a quantifier with a
  1080. minimum of zero.
  1081. <a name="bigassertions"></a></P>
  1082. <br><a name="SEC15" href="#TOC1">ASSERTIONS</a><br>
  1083. <P>
  1084. An assertion is a test on the characters following or preceding the current
  1085. matching point that does not actually consume any characters. The simple
  1086. assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
  1087. <a href="#smallassertions">above.</a>
  1088. </P>
  1089. <P>
  1090. More complicated assertions are coded as subpatterns. There are two kinds:
  1091. those that look ahead of the current position in the subject string, and those
  1092. that look behind it. An assertion subpattern is matched in the normal way,
  1093. except that it does not cause the current matching position to be changed.
  1094. </P>
  1095. <P>
  1096. Assertion subpatterns are not capturing subpatterns, and may not be repeated,
  1097. because it makes no sense to assert the same thing several times. If any kind
  1098. of assertion contains capturing subpatterns within it, these are counted for
  1099. the purposes of numbering the capturing subpatterns in the whole pattern.
  1100. However, substring capturing is carried out only for positive assertions,
  1101. because it does not make sense for negative assertions.
  1102. </P>
  1103. <br><b>
  1104. Lookahead assertions
  1105. </b><br>
  1106. <P>
  1107. Lookahead assertions start
  1108. with (?= for positive assertions and (?! for negative assertions. For example,
  1109. <pre>
  1110. \w+(?=;)
  1111. </pre>
  1112. matches a word followed by a semicolon, but does not include the semicolon in
  1113. the match, and
  1114. <pre>
  1115. foo(?!bar)
  1116. </pre>
  1117. matches any occurrence of "foo" that is not followed by "bar". Note that the
  1118. apparently similar pattern
  1119. <pre>
  1120. (?!foo)bar
  1121. </pre>
  1122. does not find an occurrence of "bar" that is preceded by something other than
  1123. "foo"; it finds any occurrence of "bar" whatsoever, because the assertion
  1124. (?!foo) is always true when the next three characters are "bar". A
  1125. lookbehind assertion is needed to achieve the other effect.
  1126. </P>
  1127. <P>
  1128. If you want to force a matching failure at some point in a pattern, the most
  1129. convenient way to do it is with (?!) because an empty string always matches, so
  1130. an assertion that requires there not to be an empty string must always fail.
  1131. <a name="lookbehind"></a></P>
  1132. <br><b>
  1133. Lookbehind assertions
  1134. </b><br>
  1135. <P>
  1136. Lookbehind assertions start with (?&#60;= for positive assertions and (?&#60;! for
  1137. negative assertions. For example,
  1138. <pre>
  1139. (?&#60;!foo)bar
  1140. </pre>
  1141. does find an occurrence of "bar" that is not preceded by "foo". The contents of
  1142. a