PageRenderTime 110ms CodeModel.GetById 15ms RepoModel.GetById 1ms app.codeStats 0ms

/Doc/library/re.rst

http://unladen-swallow.googlecode.com/
ReStructuredText | 1238 lines | 905 code | 333 blank | 0 comment | 0 complexity | be3068b3ce1201a59f6f35ea1d7ac9e6 MD5 | raw file
Possible License(s): 0BSD, BSD-3-Clause
  1. :mod:`re` --- Regular expression operations
  2. ===========================================
  3. .. module:: re
  4. :synopsis: Regular expression operations.
  5. .. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
  6. .. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
  7. This module provides regular expression matching operations similar to
  8. those found in Perl. Both patterns and strings to be searched can be
  9. Unicode strings as well as 8-bit strings.
  10. Regular expressions use the backslash character (``'\'``) to indicate
  11. special forms or to allow special characters to be used without invoking
  12. their special meaning. This collides with Python's usage of the same
  13. character for the same purpose in string literals; for example, to match
  14. a literal backslash, one might have to write ``'\\\\'`` as the pattern
  15. string, because the regular expression must be ``\\``, and each
  16. backslash must be expressed as ``\\`` inside a regular Python string
  17. literal.
  18. The solution is to use Python's raw string notation for regular expression
  19. patterns; backslashes are not handled in any special way in a string literal
  20. prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
  21. ``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
  22. newline. Usually patterns will be expressed in Python code using this raw
  23. string notation.
  24. It is important to note that most regular expression operations are available as
  25. module-level functions and :class:`RegexObject` methods. The functions are
  26. shortcuts that don't require you to compile a regex object first, but miss some
  27. fine-tuning parameters.
  28. .. seealso::
  29. Mastering Regular Expressions
  30. Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
  31. second edition of the book no longer covers Python at all, but the first
  32. edition covered writing good regular expression patterns in great detail.
  33. .. _re-syntax:
  34. Regular Expression Syntax
  35. -------------------------
  36. A regular expression (or RE) specifies a set of strings that matches it; the
  37. functions in this module let you check if a particular string matches a given
  38. regular expression (or if a given regular expression matches a particular
  39. string, which comes down to the same thing).
  40. Regular expressions can be concatenated to form new regular expressions; if *A*
  41. and *B* are both regular expressions, then *AB* is also a regular expression.
  42. In general, if a string *p* matches *A* and another string *q* matches *B*, the
  43. string *pq* will match AB. This holds unless *A* or *B* contain low precedence
  44. operations; boundary conditions between *A* and *B*; or have numbered group
  45. references. Thus, complex expressions can easily be constructed from simpler
  46. primitive expressions like the ones described here. For details of the theory
  47. and implementation of regular expressions, consult the Friedl book referenced
  48. above, or almost any textbook about compiler construction.
  49. A brief explanation of the format of regular expressions follows. For further
  50. information and a gentler presentation, consult the :ref:`regex-howto`.
  51. Regular expressions can contain both special and ordinary characters. Most
  52. ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
  53. expressions; they simply match themselves. You can concatenate ordinary
  54. characters, so ``last`` matches the string ``'last'``. (In the rest of this
  55. section, we'll write RE's in ``this special style``, usually without quotes, and
  56. strings to be matched ``'in single quotes'``.)
  57. Some characters, like ``'|'`` or ``'('``, are special. Special
  58. characters either stand for classes of ordinary characters, or affect
  59. how the regular expressions around them are interpreted. Regular
  60. expression pattern strings may not contain null bytes, but can specify
  61. the null byte using the ``\number`` notation, e.g., ``'\x00'``.
  62. The special characters are:
  63. ``'.'``
  64. (Dot.) In the default mode, this matches any character except a newline. If
  65. the :const:`DOTALL` flag has been specified, this matches any character
  66. including a newline.
  67. ``'^'``
  68. (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
  69. matches immediately after each newline.
  70. ``'$'``
  71. Matches the end of the string or just before the newline at the end of the
  72. string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
  73. matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
  74. only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
  75. matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
  76. a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
  77. the newline, and one at the end of the string.
  78. ``'*'``
  79. Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
  80. many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
  81. by any number of 'b's.
  82. ``'+'``
  83. Causes the resulting RE to match 1 or more repetitions of the preceding RE.
  84. ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
  85. match just 'a'.
  86. ``'?'``
  87. Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
  88. ``ab?`` will match either 'a' or 'ab'.
  89. ``*?``, ``+?``, ``??``
  90. The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
  91. as much text as possible. Sometimes this behaviour isn't desired; if the RE
  92. ``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
  93. string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
  94. perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
  95. characters as possible will be matched. Using ``.*?`` in the previous
  96. expression will match only ``'<H1>'``.
  97. ``{m}``
  98. Specifies that exactly *m* copies of the previous RE should be matched; fewer
  99. matches cause the entire RE not to match. For example, ``a{6}`` will match
  100. exactly six ``'a'`` characters, but not five.
  101. ``{m,n}``
  102. Causes the resulting RE to match from *m* to *n* repetitions of the preceding
  103. RE, attempting to match as many repetitions as possible. For example,
  104. ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
  105. lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
  106. example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
  107. followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
  108. modifier would be confused with the previously described form.
  109. ``{m,n}?``
  110. Causes the resulting RE to match from *m* to *n* repetitions of the preceding
  111. RE, attempting to match as *few* repetitions as possible. This is the
  112. non-greedy version of the previous qualifier. For example, on the
  113. 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
  114. while ``a{3,5}?`` will only match 3 characters.
  115. ``'\'``
  116. Either escapes special characters (permitting you to match characters like
  117. ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
  118. sequences are discussed below.
  119. If you're not using a raw string to express the pattern, remember that Python
  120. also uses the backslash as an escape sequence in string literals; if the escape
  121. sequence isn't recognized by Python's parser, the backslash and subsequent
  122. character are included in the resulting string. However, if Python would
  123. recognize the resulting sequence, the backslash should be repeated twice. This
  124. is complicated and hard to understand, so it's highly recommended that you use
  125. raw strings for all but the simplest expressions.
  126. ``[]``
  127. Used to indicate a set of characters. Characters can be listed individually, or
  128. a range of characters can be indicated by giving two characters and separating
  129. them by a ``'-'``. Special characters are not active inside sets. For example,
  130. ``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
  131. ``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
  132. ``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
  133. as ``\w`` or ``\S`` (defined below) are also acceptable inside a
  134. range, although the characters they match depends on whether :const:`LOCALE`
  135. or :const:`UNICODE` mode is in force. If you want to include a
  136. ``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
  137. place it as the first character. The pattern ``[]]`` will match
  138. ``']'``, for example.
  139. You can match the characters not within a range by :dfn:`complementing` the set.
  140. This is indicated by including a ``'^'`` as the first character of the set;
  141. ``'^'`` elsewhere will simply match the ``'^'`` character. For example,
  142. ``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
  143. character except ``'^'``.
  144. Note that inside ``[]`` the special forms and special characters lose
  145. their meanings and only the syntaxes described here are valid. For
  146. example, ``+``, ``*``, ``(``, ``)``, and so on are treated as
  147. literals inside ``[]``, and backreferences cannot be used inside
  148. ``[]``.
  149. ``'|'``
  150. ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
  151. will match either A or B. An arbitrary number of REs can be separated by the
  152. ``'|'`` in this way. This can be used inside groups (see below) as well. As
  153. the target string is scanned, REs separated by ``'|'`` are tried from left to
  154. right. When one pattern completely matches, that branch is accepted. This means
  155. that once ``A`` matches, ``B`` will not be tested further, even if it would
  156. produce a longer overall match. In other words, the ``'|'`` operator is never
  157. greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
  158. character class, as in ``[|]``.
  159. ``(...)``
  160. Matches whatever regular expression is inside the parentheses, and indicates the
  161. start and end of a group; the contents of a group can be retrieved after a match
  162. has been performed, and can be matched later in the string with the ``\number``
  163. special sequence, described below. To match the literals ``'('`` or ``')'``,
  164. use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
  165. ``(?...)``
  166. This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
  167. otherwise). The first character after the ``'?'`` determines what the meaning
  168. and further syntax of the construct is. Extensions usually do not create a new
  169. group; ``(?P<name>...)`` is the only exception to this rule. Following are the
  170. currently supported extensions.
  171. ``(?iLmsux)``
  172. (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
  173. ``'u'``, ``'x'``.) The group matches the empty string; the letters
  174. set the corresponding flags: :const:`re.I` (ignore case),
  175. :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
  176. :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
  177. and :const:`re.X` (verbose), for the entire regular expression. (The
  178. flags are described in :ref:`contents-of-module-re`.) This
  179. is useful if you wish to include the flags as part of the regular
  180. expression, instead of passing a *flag* argument to the
  181. :func:`compile` function.
  182. Note that the ``(?x)`` flag changes how the expression is parsed. It should be
  183. used first in the expression string, or after one or more whitespace characters.
  184. If there are non-whitespace characters before the flag, the results are
  185. undefined.
  186. ``(?:...)``
  187. A non-grouping version of regular parentheses. Matches whatever regular
  188. expression is inside the parentheses, but the substring matched by the group
  189. *cannot* be retrieved after performing a match or referenced later in the
  190. pattern.
  191. ``(?P<name>...)``
  192. Similar to regular parentheses, but the substring matched by the group is
  193. accessible within the rest of the regular expression via the symbolic group
  194. name *name*. Group names must be valid Python identifiers, and each group
  195. name must be defined only once within a regular expression. A symbolic group
  196. is also a numbered group, just as if the group were not named. So the group
  197. named ``id`` in the example below can also be referenced as the numbered group
  198. ``1``.
  199. For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
  200. referenced by its name in arguments to methods of match objects, such as
  201. ``m.group('id')`` or ``m.end('id')``, and also by name in the regular
  202. expression itself (using ``(?P=id)``) and replacement text given to
  203. ``.sub()`` (using ``\g<id>``).
  204. ``(?P=name)``
  205. Matches whatever text was matched by the earlier group named *name*.
  206. ``(?#...)``
  207. A comment; the contents of the parentheses are simply ignored.
  208. ``(?=...)``
  209. Matches if ``...`` matches next, but doesn't consume any of the string. This is
  210. called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
  211. ``'Isaac '`` only if it's followed by ``'Asimov'``.
  212. ``(?!...)``
  213. Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
  214. For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
  215. followed by ``'Asimov'``.
  216. ``(?<=...)``
  217. Matches if the current position in the string is preceded by a match for ``...``
  218. that ends at the current position. This is called a :dfn:`positive lookbehind
  219. assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
  220. lookbehind will back up 3 characters and check if the contained pattern matches.
  221. The contained pattern must only match strings of some fixed length, meaning that
  222. ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
  223. patterns which start with positive lookbehind assertions will never match at the
  224. beginning of the string being searched; you will most likely want to use the
  225. :func:`search` function rather than the :func:`match` function:
  226. >>> import re
  227. >>> m = re.search('(?<=abc)def', 'abcdef')
  228. >>> m.group(0)
  229. 'def'
  230. This example looks for a word following a hyphen:
  231. >>> m = re.search('(?<=-)\w+', 'spam-egg')
  232. >>> m.group(0)
  233. 'egg'
  234. ``(?<!...)``
  235. Matches if the current position in the string is not preceded by a match for
  236. ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
  237. positive lookbehind assertions, the contained pattern must only match strings of
  238. some fixed length. Patterns which start with negative lookbehind assertions may
  239. match at the beginning of the string being searched.
  240. ``(?(id/name)yes-pattern|no-pattern)``
  241. Will try to match with ``yes-pattern`` if the group with given *id* or *name*
  242. exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
  243. can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
  244. matching pattern, which will match with ``'<user@host.com>'`` as well as
  245. ``'user@host.com'``, but not with ``'<user@host.com'``.
  246. .. versionadded:: 2.4
  247. The special sequences consist of ``'\'`` and a character from the list below.
  248. If the ordinary character is not on the list, then the resulting RE will match
  249. the second character. For example, ``\$`` matches the character ``'$'``.
  250. ``\number``
  251. Matches the contents of the group of the same number. Groups are numbered
  252. starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
  253. but not ``'the end'`` (note the space after the group). This special sequence
  254. can only be used to match one of the first 99 groups. If the first digit of
  255. *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
  256. a group match, but as the character with octal value *number*. Inside the
  257. ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
  258. characters.
  259. ``\A``
  260. Matches only at the start of the string.
  261. ``\b``
  262. Matches the empty string, but only at the beginning or end of a word. A word is
  263. defined as a sequence of alphanumeric or underscore characters, so the end of a
  264. word is indicated by whitespace or a non-alphanumeric, non-underscore character.
  265. Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
  266. precise set of characters deemed to be alphanumeric depends on the values of the
  267. ``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
  268. the backspace character, for compatibility with Python's string literals.
  269. ``\B``
  270. Matches the empty string, but only when it is *not* at the beginning or end of a
  271. word. This is just the opposite of ``\b``, so is also subject to the settings
  272. of ``LOCALE`` and ``UNICODE``.
  273. ``\d``
  274. When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
  275. is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
  276. whatever is classified as a digit in the Unicode character properties database.
  277. ``\D``
  278. When the :const:`UNICODE` flag is not specified, matches any non-digit
  279. character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
  280. will match anything other than character marked as digits in the Unicode
  281. character properties database.
  282. ``\s``
  283. When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
  284. any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
  285. :const:`LOCALE`, it will match this set plus whatever characters are defined as
  286. space for the current locale. If :const:`UNICODE` is set, this will match the
  287. characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
  288. character properties database.
  289. ``\S``
  290. When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
  291. any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
  292. With :const:`LOCALE`, it will match any character not in this set, and not
  293. defined as space in the current locale. If :const:`UNICODE` is set, this will
  294. match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
  295. the Unicode character properties database.
  296. ``\w``
  297. When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
  298. any alphanumeric character and the underscore; this is equivalent to the set
  299. ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
  300. whatever characters are defined as alphanumeric for the current locale. If
  301. :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
  302. is classified as alphanumeric in the Unicode character properties database.
  303. ``\W``
  304. When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
  305. any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
  306. With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
  307. not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
  308. this will match anything other than ``[0-9_]`` and characters marked as
  309. alphanumeric in the Unicode character properties database.
  310. ``\Z``
  311. Matches only at the end of the string.
  312. Most of the standard escapes supported by Python string literals are also
  313. accepted by the regular expression parser::
  314. \a \b \f \n
  315. \r \t \v \x
  316. \\
  317. Octal escapes are included in a limited form: If the first digit is a 0, or if
  318. there are three octal digits, it is considered an octal escape. Otherwise, it is
  319. a group reference. As for string literals, octal escapes are always at most
  320. three digits in length.
  321. .. _matching-searching:
  322. Matching vs Searching
  323. ---------------------
  324. .. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
  325. Python offers two different primitive operations based on regular expressions:
  326. **match** checks for a match only at the beginning of the string, while
  327. **search** checks for a match anywhere in the string (this is what Perl does
  328. by default).
  329. Note that match may differ from search even when using a regular expression
  330. beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
  331. :const:`MULTILINE` mode also immediately following a newline. The "match"
  332. operation succeeds only if the pattern matches at the start of the string
  333. regardless of mode, or at the starting position given by the optional *pos*
  334. argument regardless of whether a newline precedes it.
  335. >>> re.match("c", "abcdef") # No match
  336. >>> re.search("c", "abcdef") # Match
  337. <_sre.SRE_Match object at ...>
  338. .. _contents-of-module-re:
  339. Module Contents
  340. ---------------
  341. The module defines several functions, constants, and an exception. Some of the
  342. functions are simplified versions of the full featured methods for compiled
  343. regular expressions. Most non-trivial applications always use the compiled
  344. form.
  345. .. function:: compile(pattern[, flags])
  346. Compile a regular expression pattern into a regular expression object, which
  347. can be used for matching using its :func:`match` and :func:`search` methods,
  348. described below.
  349. The expression's behaviour can be modified by specifying a *flags* value.
  350. Values can be any of the following variables, combined using bitwise OR (the
  351. ``|`` operator).
  352. The sequence ::
  353. prog = re.compile(pattern)
  354. result = prog.match(string)
  355. is equivalent to ::
  356. result = re.match(pattern, string)
  357. but using :func:`compile` and saving the resulting regular expression object
  358. for reuse is more efficient when the expression will be used several times
  359. in a single program.
  360. .. note::
  361. The compiled versions of the most recent patterns passed to
  362. :func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
  363. programs that use only a few regular expressions at a time needn't worry
  364. about compiling regular expressions.
  365. .. data:: I
  366. IGNORECASE
  367. Perform case-insensitive matching; expressions like ``[A-Z]`` will match
  368. lowercase letters, too. This is not affected by the current locale.
  369. .. data:: L
  370. LOCALE
  371. Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
  372. current locale.
  373. .. data:: M
  374. MULTILINE
  375. When specified, the pattern character ``'^'`` matches at the beginning of the
  376. string and at the beginning of each line (immediately following each newline);
  377. and the pattern character ``'$'`` matches at the end of the string and at the
  378. end of each line (immediately preceding each newline). By default, ``'^'``
  379. matches only at the beginning of the string, and ``'$'`` only at the end of the
  380. string and immediately before the newline (if any) at the end of the string.
  381. .. data:: S
  382. DOTALL
  383. Make the ``'.'`` special character match any character at all, including a
  384. newline; without this flag, ``'.'`` will match anything *except* a newline.
  385. .. data:: U
  386. UNICODE
  387. Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
  388. on the Unicode character properties database.
  389. .. versionadded:: 2.0
  390. .. data:: X
  391. VERBOSE
  392. This flag allows you to write regular expressions that look nicer. Whitespace
  393. within the pattern is ignored, except when in a character class or preceded by
  394. an unescaped backslash, and, when a line contains a ``'#'`` neither in a
  395. character class or preceded by an unescaped backslash, all characters from the
  396. leftmost such ``'#'`` through the end of the line are ignored.
  397. That means that the two following regular expression objects that match a
  398. decimal number are functionally equal::
  399. a = re.compile(r"""\d + # the integral part
  400. \. # the decimal point
  401. \d * # some fractional digits""", re.X)
  402. b = re.compile(r"\d+\.\d*")
  403. .. function:: search(pattern, string[, flags])
  404. Scan through *string* looking for a location where the regular expression
  405. *pattern* produces a match, and return a corresponding :class:`MatchObject`
  406. instance. Return ``None`` if no position in the string matches the pattern; note
  407. that this is different from finding a zero-length match at some point in the
  408. string.
  409. .. function:: match(pattern, string[, flags])
  410. If zero or more characters at the beginning of *string* match the regular
  411. expression *pattern*, return a corresponding :class:`MatchObject` instance.
  412. Return ``None`` if the string does not match the pattern; note that this is
  413. different from a zero-length match.
  414. .. note::
  415. If you want to locate a match anywhere in *string*, use :meth:`search`
  416. instead.
  417. .. function:: split(pattern, string[, maxsplit=0])
  418. Split *string* by the occurrences of *pattern*. If capturing parentheses are
  419. used in *pattern*, then the text of all groups in the pattern are also returned
  420. as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
  421. splits occur, and the remainder of the string is returned as the final element
  422. of the list. (Incompatibility note: in the original Python 1.5 release,
  423. *maxsplit* was ignored. This has been fixed in later releases.)
  424. >>> re.split('\W+', 'Words, words, words.')
  425. ['Words', 'words', 'words', '']
  426. >>> re.split('(\W+)', 'Words, words, words.')
  427. ['Words', ', ', 'words', ', ', 'words', '.', '']
  428. >>> re.split('\W+', 'Words, words, words.', 1)
  429. ['Words', 'words, words.']
  430. If there are capturing groups in the separator and it matches at the start of
  431. the string, the result will start with an empty string. The same holds for
  432. the end of the string:
  433. >>> re.split('(\W+)', '...words, words...')
  434. ['', '...', 'words', ', ', 'words', '...', '']
  435. That way, separator components are always found at the same relative
  436. indices within the result list (e.g., if there's one capturing group
  437. in the separator, the 0th, the 2nd and so forth).
  438. Note that *split* will never split a string on an empty pattern match.
  439. For example:
  440. >>> re.split('x*', 'foo')
  441. ['foo']
  442. >>> re.split("(?m)^$", "foo\n\nbar\n")
  443. ['foo\n\nbar\n']
  444. .. function:: findall(pattern, string[, flags])
  445. Return all non-overlapping matches of *pattern* in *string*, as a list of
  446. strings. The *string* is scanned left-to-right, and matches are returned in
  447. the order found. If one or more groups are present in the pattern, return a
  448. list of groups; this will be a list of tuples if the pattern has more than
  449. one group. Empty matches are included in the result unless they touch the
  450. beginning of another match.
  451. .. versionadded:: 1.5.2
  452. .. versionchanged:: 2.4
  453. Added the optional flags argument.
  454. .. function:: finditer(pattern, string[, flags])
  455. Return an :term:`iterator` yielding :class:`MatchObject` instances over all
  456. non-overlapping matches for the RE *pattern* in *string*. The *string* is
  457. scanned left-to-right, and matches are returned in the order found. Empty
  458. matches are included in the result unless they touch the beginning of another
  459. match.
  460. .. versionadded:: 2.2
  461. .. versionchanged:: 2.4
  462. Added the optional flags argument.
  463. .. function:: sub(pattern, repl, string[, count])
  464. Return the string obtained by replacing the leftmost non-overlapping occurrences
  465. of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
  466. *string* is returned unchanged. *repl* can be a string or a function; if it is
  467. a string, any backslash escapes in it are processed. That is, ``\n`` is
  468. converted to a single newline character, ``\r`` is converted to a linefeed, and
  469. so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
  470. as ``\6``, are replaced with the substring matched by group 6 in the pattern.
  471. For example:
  472. >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
  473. ... r'static PyObject*\npy_\1(void)\n{',
  474. ... 'def myfunc():')
  475. 'static PyObject*\npy_myfunc(void)\n{'
  476. If *repl* is a function, it is called for every non-overlapping occurrence of
  477. *pattern*. The function takes a single match object argument, and returns the
  478. replacement string. For example:
  479. >>> def dashrepl(matchobj):
  480. ... if matchobj.group(0) == '-': return ' '
  481. ... else: return '-'
  482. >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
  483. 'pro--gram files'
  484. The pattern may be a string or an RE object; if you need to specify regular
  485. expression flags, you must use a RE object, or use embedded modifiers in a
  486. pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
  487. The optional argument *count* is the maximum number of pattern occurrences to be
  488. replaced; *count* must be a non-negative integer. If omitted or zero, all
  489. occurrences will be replaced. Empty matches for the pattern are replaced only
  490. when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
  491. ``'-a-b-c-'``.
  492. In addition to character escapes and backreferences as described above,
  493. ``\g<name>`` will use the substring matched by the group named ``name``, as
  494. defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
  495. group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
  496. in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
  497. reference to group 20, not a reference to group 2 followed by the literal
  498. character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
  499. substring matched by the RE.
  500. .. function:: subn(pattern, repl, string[, count])
  501. Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
  502. number_of_subs_made)``.
  503. .. function:: escape(string)
  504. Return *string* with all non-alphanumerics backslashed; this is useful if you
  505. want to match an arbitrary literal string that may have regular expression
  506. metacharacters in it.
  507. .. exception:: error
  508. Exception raised when a string passed to one of the functions here is not a
  509. valid regular expression (for example, it might contain unmatched parentheses)
  510. or when some other error occurs during compilation or matching. It is never an
  511. error if a string contains no match for a pattern.
  512. .. _re-objects:
  513. Regular Expression Objects
  514. --------------------------
  515. Compiled regular expression objects support the following methods and
  516. attributes:
  517. .. method:: RegexObject.match(string[, pos[, endpos]])
  518. If zero or more characters at the beginning of *string* match this regular
  519. expression, return a corresponding :class:`MatchObject` instance. Return
  520. ``None`` if the string does not match the pattern; note that this is different
  521. from a zero-length match.
  522. .. note::
  523. If you want to locate a match anywhere in *string*, use :meth:`search`
  524. instead.
  525. The optional second parameter *pos* gives an index in the string where the
  526. search is to start; it defaults to ``0``. This is not completely equivalent to
  527. slicing the string; the ``'^'`` pattern character matches at the real beginning
  528. of the string and at positions just after a newline, but not necessarily at the
  529. index where the search is to start.
  530. The optional parameter *endpos* limits how far the string will be searched; it
  531. will be as if the string is *endpos* characters long, so only the characters
  532. from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
  533. than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
  534. expression object, ``rx.match(string, 0, 50)`` is equivalent to
  535. ``rx.match(string[:50], 0)``.
  536. >>> pattern = re.compile("o")
  537. >>> pattern.match("dog") # No match as "o" is not at the start of "dog."
  538. >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
  539. <_sre.SRE_Match object at ...>
  540. .. method:: RegexObject.search(string[, pos[, endpos]])
  541. Scan through *string* looking for a location where this regular expression
  542. produces a match, and return a corresponding :class:`MatchObject` instance.
  543. Return ``None`` if no position in the string matches the pattern; note that this
  544. is different from finding a zero-length match at some point in the string.
  545. The optional *pos* and *endpos* parameters have the same meaning as for the
  546. :meth:`match` method.
  547. .. method:: RegexObject.split(string[, maxsplit=0])
  548. Identical to the :func:`split` function, using the compiled pattern.
  549. .. method:: RegexObject.findall(string[, pos[, endpos]])
  550. Identical to the :func:`findall` function, using the compiled pattern.
  551. .. method:: RegexObject.finditer(string[, pos[, endpos]])
  552. Identical to the :func:`finditer` function, using the compiled pattern.
  553. .. method:: RegexObject.sub(repl, string[, count=0])
  554. Identical to the :func:`sub` function, using the compiled pattern.
  555. .. method:: RegexObject.subn(repl, string[, count=0])
  556. Identical to the :func:`subn` function, using the compiled pattern.
  557. .. attribute:: RegexObject.flags
  558. The flags argument used when the RE object was compiled, or ``0`` if no flags
  559. were provided.
  560. .. attribute:: RegexObject.groups
  561. The number of capturing groups in the pattern.
  562. .. attribute:: RegexObject.groupindex
  563. A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
  564. numbers. The dictionary is empty if no symbolic groups were used in the
  565. pattern.
  566. .. attribute:: RegexObject.pattern
  567. The pattern string from which the RE object was compiled.
  568. .. _match-objects:
  569. Match Objects
  570. -------------
  571. Match objects always have a boolean value of :const:`True`, so that you can test
  572. whether e.g. :func:`match` resulted in a match with a simple if statement. They
  573. support the following methods and attributes:
  574. .. method:: MatchObject.expand(template)
  575. Return the string obtained by doing backslash substitution on the template
  576. string *template*, as done by the :meth:`sub` method. Escapes such as ``\n`` are
  577. converted to the appropriate characters, and numeric backreferences (``\1``,
  578. ``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
  579. contents of the corresponding group.
  580. .. method:: MatchObject.group([group1, ...])
  581. Returns one or more subgroups of the match. If there is a single argument, the
  582. result is a single string; if there are multiple arguments, the result is a
  583. tuple with one item per argument. Without arguments, *group1* defaults to zero
  584. (the whole match is returned). If a *groupN* argument is zero, the corresponding
  585. return value is the entire matching string; if it is in the inclusive range
  586. [1..99], it is the string matching the corresponding parenthesized group. If a
  587. group number is negative or larger than the number of groups defined in the
  588. pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
  589. part of the pattern that did not match, the corresponding result is ``None``.
  590. If a group is contained in a part of the pattern that matched multiple times,
  591. the last match is returned.
  592. >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
  593. >>> m.group(0) # The entire match
  594. 'Isaac Newton'
  595. >>> m.group(1) # The first parenthesized subgroup.
  596. 'Isaac'
  597. >>> m.group(2) # The second parenthesized subgroup.
  598. 'Newton'
  599. >>> m.group(1, 2) # Multiple arguments give us a tuple.
  600. ('Isaac', 'Newton')
  601. If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
  602. arguments may also be strings identifying groups by their group name. If a
  603. string argument is not used as a group name in the pattern, an :exc:`IndexError`
  604. exception is raised.
  605. A moderately complicated example:
  606. >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
  607. >>> m.group('first_name')
  608. 'Malcom'
  609. >>> m.group('last_name')
  610. 'Reynolds'
  611. Named groups can also be referred to by their index:
  612. >>> m.group(1)
  613. 'Malcom'
  614. >>> m.group(2)
  615. 'Reynolds'
  616. If a group matches multiple times, only the last match is accessible:
  617. >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
  618. >>> m.group(1) # Returns only the last match.
  619. 'c3'
  620. .. method:: MatchObject.groups([default])
  621. Return a tuple containing all the subgroups of the match, from 1 up to however
  622. many groups are in the pattern. The *default* argument is used for groups that
  623. did not participate in the match; it defaults to ``None``. (Incompatibility
  624. note: in the original Python 1.5 release, if the tuple was one element long, a
  625. string would be returned instead. In later versions (from 1.5.1 on), a
  626. singleton tuple is returned in such cases.)
  627. For example:
  628. >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
  629. >>> m.groups()
  630. ('24', '1632')
  631. If we make the decimal place and everything after it optional, not all groups
  632. might participate in the match. These groups will default to ``None`` unless
  633. the *default* argument is given:
  634. >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
  635. >>> m.groups() # Second group defaults to None.
  636. ('24', None)
  637. >>> m.groups('0') # Now, the second group defaults to '0'.
  638. ('24', '0')
  639. .. method:: MatchObject.groupdict([default])
  640. Return a dictionary containing all the *named* subgroups of the match, keyed by
  641. the subgroup name. The *default* argument is used for groups that did not
  642. participate in the match; it defaults to ``None``. For example:
  643. >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
  644. >>> m.groupdict()
  645. {'first_name': 'Malcom', 'last_name': 'Reynolds'}
  646. .. method:: MatchObject.start([group])
  647. MatchObject.end([group])
  648. Return the indices of the start and end of the substring matched by *group*;
  649. *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
  650. *group* exists but did not contribute to the match. For a match object *m*, and
  651. a group *g* that did contribute to the match, the substring matched by group *g*
  652. (equivalent to ``m.group(g)``) is ::
  653. m.string[m.start(g):m.end(g)]
  654. Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
  655. null string. For example, after ``m = re.search('b(c?)', 'cba')``,
  656. ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
  657. 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
  658. An example that will remove *remove_this* from email addresses:
  659. >>> email = "tony@tiremove_thisger.net"
  660. >>> m = re.search("remove_this", email)
  661. >>> email[:m.start()] + email[m.end():]
  662. 'tony@tiger.net'
  663. .. method:: MatchObject.span([group])
  664. For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
  665. m.end(group))``. Note that if *group* did not contribute to the match, this is
  666. ``(-1, -1)``. *group* defaults to zero, the entire match.
  667. .. attribute:: MatchObject.pos
  668. The value of *pos* which was passed to the :func:`search` or :func:`match`
  669. method of the :class:`RegexObject`. This is the index into the string at which
  670. the RE engine started looking for a match.
  671. .. attribute:: MatchObject.endpos
  672. The value of *endpos* which was passed to the :func:`search` or :func:`match`
  673. method of the :class:`RegexObject`. This is the index into the string beyond
  674. which the RE engine will not go.
  675. .. attribute:: MatchObject.lastindex
  676. The integer index of the last matched capturing group, or ``None`` if no group
  677. was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
  678. ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
  679. the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
  680. string.
  681. .. attribute:: MatchObject.lastgroup
  682. The name of the last matched capturing group, or ``None`` if the group didn't
  683. have a name, or if no group was matched at all.
  684. .. attribute:: MatchObject.re
  685. The regular expression object whose :meth:`match` or :meth:`search` method
  686. produced this :class:`MatchObject` instance.
  687. .. attribute:: MatchObject.string
  688. The string passed to :func:`match` or :func:`search`.
  689. Examples
  690. --------
  691. Checking For a Pair
  692. ^^^^^^^^^^^^^^^^^^^
  693. In this example, we'll use the following helper function to display match
  694. objects a little more gracefully:
  695. .. testcode::
  696. def displaymatch(match):
  697. if match is None:
  698. return None
  699. return '<Match: %r, groups=%r>' % (match.group(), match.groups())
  700. Suppose you are writing a poker program where a player's hand is represented as
  701. a 5-character string with each character representing a card, "a" for ace, "k"
  702. for king, "q" for queen, j for jack, "0" for 10, and "1" through "9"
  703. representing the card with that value.
  704. To see if a given string is a valid hand, one could do the following:
  705. >>> valid = re.compile(r"[0-9akqj]{5}$")
  706. >>> displaymatch(valid.match("ak05q")) # Valid.
  707. "<Match: 'ak05q', groups=()>"
  708. >>> displaymatch(valid.match("ak05e")) # Invalid.
  709. >>> displaymatch(valid.match("ak0")) # Invalid.
  710. >>> displaymatch(valid.match("727ak")) # Valid.
  711. "<Match: '727ak', groups=()>"
  712. That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
  713. To match this with a regular expression, one could use backreferences as such:
  714. >>> pair = re.compile(r".*(.).*\1")
  715. >>> displaymatch(pair.match("717ak")) # Pair of 7s.
  716. "<Match: '717', groups=('7',)>"
  717. >>> displaymatch(pair.match("718ak")) # No pairs.
  718. >>> displaymatch(pair.match("354aa")) # Pair of aces.
  719. "<Match: '354aa', groups=('a',)>"
  720. To find out what card the pair consists of, one could use the :func:`group`
  721. method of :class:`MatchObject` in the following manner:
  722. .. doctest::
  723. >>> pair.match("717ak").group(1)
  724. '7'
  725. # Error because re.match() returns None, which doesn't have a group() method:
  726. >>> pair.match("718ak").group(1)
  727. Traceback (most recent call last):
  728. File "<pyshell#23>", line 1, in <module>
  729. re.match(r".*(.).*\1", "718ak").group(1)
  730. AttributeError: 'NoneType' object has no attribute 'group'
  731. >>> pair.match("354aa").group(1)
  732. 'a'
  733. Simulating scanf()
  734. ^^^^^^^^^^^^^^^^^^
  735. .. index:: single: scanf()
  736. Python does not currently have an equivalent to :cfunc:`scanf`. Regular
  737. expressions are generally more powerful, though also more verbose, than
  738. :cfunc:`scanf` format strings. The table below offers some more-or-less
  739. equivalent mappings between :cfunc:`scanf` format tokens and regular
  740. expressions.
  741. +--------------------------------+---------------------------------------------+
  742. | :cfunc:`scanf` Token | Regular Expression |
  743. +================================+=============================================+
  744. | ``%c`` | ``.`` |
  745. +--------------------------------+---------------------------------------------+
  746. | ``%5c`` | ``.{5}`` |
  747. +--------------------------------+---------------------------------------------+
  748. | ``%d`` | ``[-+]?\d+`` |
  749. +--------------------------------+---------------------------------------------+
  750. | ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
  751. +--------------------------------+---------------------------------------------+
  752. | ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
  753. +--------------------------------+---------------------------------------------+
  754. | ``%o`` | ``0[0-7]*`` |
  755. +--------------------------------+---------------------------------------------+
  756. | ``%s`` | ``\S+`` |
  757. +--------------------------------+---------------------------------------------+
  758. | ``%u`` | ``\d+`` |
  759. +--------------------------------+---------------------------------------------+
  760. | ``%x``, ``%X`` | ``0[xX][\dA-Fa-f]+`` |
  761. +--------------------------------+---------------------------------------------+
  762. To extract the filename and numbers from a string like ::
  763. /usr/sbin/sendmail - 0 errors, 4 warnings
  764. you would use a :cfunc:`scanf` format like ::
  765. %s - %d errors, %d warnings
  766. The equivalent regular expression would be ::
  767. (\S+) - (\d+) errors, (\d+) warnings
  768. Avoiding recursion
  769. ^^^^^^^^^^^^^^^^^^
  770. If you create regular expressions that require the engine to perform a lot of
  771. recursion, you may encounter a :exc:`RuntimeError` exception with the message
  772. ``maximum recursion limit`` exceeded. For example, ::
  773. >>> s = 'Begin ' + 1000*'a very long string ' + 'end'
  774. >>> re.match('Begin (\w| )*? end', s).end()
  775. Traceback (most recent call last):
  776. File "<stdin>", line 1, in ?
  777. File "/usr/local/lib/python2.5/re.py", line 132, in match
  778. return _compile(pattern, flags).match(string)
  779. RuntimeError: maximum recursion limit exceeded
  780. You can often restructure your regular expression to avoid recursion.
  781. Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
  782. avoid recursion. Thus, the above regular expression can avoid recursion by
  783. being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such
  784. regular expressions will run faster than their recursive equivalents.
  785. search() vs. match()
  786. ^^^^^^^^^^^^^^^^^^^^
  787. In a nutshell, :func:`match` only attempts to match a pattern at the beginning
  788. of a string where :func:`search` will match a pattern anywhere in a string.
  789. For example:
  790. >>> re.match("o", "dog") # No match as "o" is not the first letter of "dog".
  791. >>> re.search("o", "dog") # Match as search() looks everywhere in the string.
  792. <_sre.SRE_Match object at ...>
  793. .. note::
  794. The following applies only to regular expression objects like those created
  795. with ``re.compile("pattern")``, not the primitives ``re.match(pattern,
  796. string)`` or ``re.search(pattern, string)``.
  797. :func:`match` has an optional second parameter that gives an index in the string
  798. where the search is to start::
  799. >>> pattern = re.compile("o")
  800. >>> pattern.match("dog") # No match as "o" is not at the start of "dog."
  801. # Equivalent to the above expression as 0 is the default starting index:
  802. >>> pattern.match("dog", 0)
  803. # Match as "o" is the 2nd character of "dog" (index 0 is the first):
  804. >>> pattern.match("dog", 1)
  805. <_sre.SRE_Match object at ...>
  806. >>> pattern.match("dog", 2) # No match as "o" is not the 3rd character of "dog."
  807. Making a Phonebook
  808. ^^^^^^^^^^^^^^^^^^
  809. :func:`split` splits a string into a list delimited by the passed pattern. The
  810. method is invaluable for converting textual data into data structures that can be
  811. easily read and modified by Python as demonstrated in the following example that
  812. creates a phonebook.
  813. First, here is the input. Normally it may come from a file, here we are using
  814. triple-quoted string syntax:
  815. >>> input = """Ross McFluff: 834.345.1254 155 Elm Street
  816. ...
  817. ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
  818. ... Frank Burger: 925.541.7625 662 South Dogwood Way
  819. ...
  820. ...
  821. ... Heather Albrecht: 548.326.4584 919 Park Place"""
  822. The entries are separated by one or more newlines. Now we convert the string
  823. into a list with each nonempty line having its own entry:
  824. .. doctest::
  825. :options: +NORMALIZE_WHITESPACE
  826. >>> entries = re.split("\n+", input)
  827. >>> entries
  828. ['Ross McFluff: 834.345.1254 155 Elm Street',
  829. 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
  830. 'Frank Burger: 925.541.7625 662 South Dogwood Way',
  831. 'Heather Albrecht: 548.326.4584 919 Park Place']
  832. Finally, split each entry into a list with first name, last name, telephone
  833. number, and address. We use the ``maxsplit`` parameter of :func:`split`
  834. because the address has spaces, our splitting pattern, in it:
  835. .. doctest::
  836. :options: +NORMALIZE_WHITESPACE
  837. >>> [re.split(":? ", entry, 3) for entry in entries]
  838. [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
  839. ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
  840. ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
  841. ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
  842. The ``:?`` pattern matches the colon after the last name, so that it does not
  843. occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
  844. house number from the street name:
  845. .. doctest::
  846. :options: +NORMALIZE_WHITESPACE
  847. >>> [re.split(":? ", entry, 4) for entry in entries]
  848. [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
  849. ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
  850. ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
  851. ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
  852. Text Munging
  853. ^^^^^^^^^^^^
  854. :func:`sub` replaces every occurrence of a pattern with a string or the
  855. result of a function. This example demonstrates using :func:`sub` with
  856. a function to "munge" text, or randomize the order of all the characters
  857. in each word of a sentence except for the first and last characters::
  858. >>> def repl(m):
  859. ... inner_word = list(m.group(2))
  860. ... random.shuffle(inner_word)
  861. ... return m.group(1) + "".join(inner_word) + m.group(3)
  862. >>> text = "Professor Abdolmalek, please report your absences promptly."
  863. >>> re.sub("(\w)(\w+)(\w)", repl, text)
  864. 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
  865. >>> re.sub("(\w)(\w+)(\w)", repl, text)
  866. 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
  867. Finding all Adverbs
  868. ^^^^^^^^^^^^^^^^^^^
  869. :func:`findall` matches *all* occurrences of a pattern, not just the first
  870. one as :func:`search` does. For example, if one was a writer and wanted to
  871. find all of the adverbs in some text, he or she might use :func:`findall` in
  872. the following manner:
  873. >>> text = "He was carefully disguised but captured quickly by police."
  874. >>> re.findall(r"\w+ly", text)
  875. ['carefully', 'quickly']
  876. Finding all Adverbs and their Positions
  877. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  878. If one wants more information about all matches of a pattern than the matched
  879. text, :func:`finditer` is useful as it provides instances of
  880. :class:`MatchObject` instead of strings. Continuing with the previous example,
  881. if one was a writer who wanted to find all of the adverbs *and their positions*
  882. in some text, he or she would use :func:`finditer` in the following manner:
  883. >>> text = "He was carefully disguised but captured quickly by police."
  884. >>> for m in re.finditer(r"\w+ly", text):
  885. ... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
  886. 07-16: carefully
  887. 40-47: quickly
  888. Raw String Notation
  889. ^^^^^^^^^^^^^^^^^^^
  890. Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
  891. every backslash (``'\'``) in a regular expression would have to be prefixed with
  892. another one to escape it. For example, the two following lines of code are
  893. functionally identical:
  894. >>> re.match(r"\W(.)\1\W", " ff ")
  895. <_sre.SRE_Match object at ...>
  896. >>> re.match("\\W(.)\\1\\W", " ff ")
  897. <_sre.SRE_Match object at ...>
  898. When one wants to match a literal backslash, it must be escaped in the regular
  899. expression. With raw string notation, this means ``r"\\"``. Without raw string
  900. notation, one must use ``"\\\\"``, making the following lines of code
  901. functionally identical:
  902. >>> re.match(r"\\", r"\\")
  903. <_sre.SRE_Match object at ...>
  904. >>> re.match("\\\\", r"\\")
  905. <_sre.SRE_Match object at ...>