PageRenderTime 74ms CodeModel.GetById 14ms app.highlight 42ms RepoModel.GetById 2ms app.codeStats 0ms

/Doc/library/re.rst

http://unladen-swallow.googlecode.com/
ReStructuredText | 1238 lines | 905 code | 333 blank | 0 comment | 0 complexity | be3068b3ce1201a59f6f35ea1d7ac9e6 MD5 | raw file

Large files files are truncated, but you can click here to view the full file

   1
   2:mod:`re` --- Regular expression operations
   3===========================================
   4
   5.. module:: re
   6   :synopsis: Regular expression operations.
   7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
   8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
   9
  10
  11This module provides regular expression matching operations similar to
  12those found in Perl. Both patterns and strings to be searched can be
  13Unicode strings as well as 8-bit strings.
  14
  15Regular expressions use the backslash character (``'\'``) to indicate
  16special forms or to allow special characters to be used without invoking
  17their special meaning.  This collides with Python's usage of the same
  18character for the same purpose in string literals; for example, to match
  19a literal backslash, one might have to write ``'\\\\'`` as the pattern
  20string, because the regular expression must be ``\\``, and each
  21backslash must be expressed as ``\\`` inside a regular Python string
  22literal.
  23
  24The solution is to use Python's raw string notation for regular expression
  25patterns; backslashes are not handled in any special way in a string literal
  26prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
  27``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
  28newline.  Usually patterns will be expressed in Python code using this raw
  29string notation.
  30
  31It is important to note that most regular expression operations are available as
  32module-level functions and :class:`RegexObject` methods.  The functions are
  33shortcuts that don't require you to compile a regex object first, but miss some
  34fine-tuning parameters.
  35
  36.. seealso::
  37
  38   Mastering Regular Expressions
  39      Book on regular expressions by Jeffrey Friedl, published by O'Reilly.  The
  40      second edition of the book no longer covers Python at all, but the first
  41      edition covered writing good regular expression patterns in great detail.
  42
  43
  44.. _re-syntax:
  45
  46Regular Expression Syntax
  47-------------------------
  48
  49A regular expression (or RE) specifies a set of strings that matches it; the
  50functions in this module let you check if a particular string matches a given
  51regular expression (or if a given regular expression matches a particular
  52string, which comes down to the same thing).
  53
  54Regular expressions can be concatenated to form new regular expressions; if *A*
  55and *B* are both regular expressions, then *AB* is also a regular expression.
  56In general, if a string *p* matches *A* and another string *q* matches *B*, the
  57string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
  58operations; boundary conditions between *A* and *B*; or have numbered group
  59references.  Thus, complex expressions can easily be constructed from simpler
  60primitive expressions like the ones described here.  For details of the theory
  61and implementation of regular expressions, consult the Friedl book referenced
  62above, or almost any textbook about compiler construction.
  63
  64A brief explanation of the format of regular expressions follows.  For further
  65information and a gentler presentation, consult the :ref:`regex-howto`.
  66
  67Regular expressions can contain both special and ordinary characters. Most
  68ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
  69expressions; they simply match themselves.  You can concatenate ordinary
  70characters, so ``last`` matches the string ``'last'``.  (In the rest of this
  71section, we'll write RE's in ``this special style``, usually without quotes, and
  72strings to be matched ``'in single quotes'``.)
  73
  74Some characters, like ``'|'`` or ``'('``, are special. Special
  75characters either stand for classes of ordinary characters, or affect
  76how the regular expressions around them are interpreted. Regular
  77expression pattern strings may not contain null bytes, but can specify
  78the null byte using the ``\number`` notation, e.g., ``'\x00'``.
  79
  80
  81The special characters are:
  82
  83``'.'``
  84   (Dot.)  In the default mode, this matches any character except a newline.  If
  85   the :const:`DOTALL` flag has been specified, this matches any character
  86   including a newline.
  87
  88``'^'``
  89   (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
  90   matches immediately after each newline.
  91
  92``'$'``
  93   Matches the end of the string or just before the newline at the end of the
  94   string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
  95   matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
  96   only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
  97   matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
  98   a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
  99   the newline, and one at the end of the string.
 100
 101``'*'``
 102   Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
 103   many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
 104   by any number of 'b's.
 105
 106``'+'``
 107   Causes the resulting RE to match 1 or more repetitions of the preceding RE.
 108   ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
 109   match just 'a'.
 110
 111``'?'``
 112   Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
 113   ``ab?`` will match either 'a' or 'ab'.
 114
 115``*?``, ``+?``, ``??``
 116   The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
 117   as much text as possible.  Sometimes this behaviour isn't desired; if the RE
 118   ``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
 119   string, and not just ``'<H1>'``.  Adding ``'?'`` after the qualifier makes it
 120   perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
 121   characters as possible will be matched.  Using ``.*?`` in the previous
 122   expression will match only ``'<H1>'``.
 123
 124``{m}``
 125   Specifies that exactly *m* copies of the previous RE should be matched; fewer
 126   matches cause the entire RE not to match.  For example, ``a{6}`` will match
 127   exactly six ``'a'`` characters, but not five.
 128
 129``{m,n}``
 130   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
 131   RE, attempting to match as many repetitions as possible.  For example,
 132   ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
 133   lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
 134   example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
 135   followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
 136   modifier would be confused with the previously described form.
 137
 138``{m,n}?``
 139   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
 140   RE, attempting to match as *few* repetitions as possible.  This is the
 141   non-greedy version of the previous qualifier.  For example, on the
 142   6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
 143   while ``a{3,5}?`` will only match 3 characters.
 144
 145``'\'``
 146   Either escapes special characters (permitting you to match characters like
 147   ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
 148   sequences are discussed below.
 149
 150   If you're not using a raw string to express the pattern, remember that Python
 151   also uses the backslash as an escape sequence in string literals; if the escape
 152   sequence isn't recognized by Python's parser, the backslash and subsequent
 153   character are included in the resulting string.  However, if Python would
 154   recognize the resulting sequence, the backslash should be repeated twice.  This
 155   is complicated and hard to understand, so it's highly recommended that you use
 156   raw strings for all but the simplest expressions.
 157
 158``[]``
 159   Used to indicate a set of characters.  Characters can be listed individually, or
 160   a range of characters can be indicated by giving two characters and separating
 161   them by a ``'-'``.  Special characters are not active inside sets.  For example,
 162   ``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
 163   ``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
 164   ``[a-zA-Z0-9]`` matches any letter or digit.  Character classes such
 165   as ``\w`` or ``\S`` (defined below) are also acceptable inside a
 166   range, although the characters they match depends on whether :const:`LOCALE`
 167   or  :const:`UNICODE` mode is in force.  If you want to include a
 168   ``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
 169   place it as the first character.  The pattern ``[]]`` will match
 170   ``']'``, for example.
 171
 172   You can match the characters not within a range by :dfn:`complementing` the set.
 173   This is indicated by including a ``'^'`` as the first character of the set;
 174   ``'^'`` elsewhere will simply match the ``'^'`` character.  For example,
 175   ``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
 176   character except ``'^'``.
 177
 178   Note that inside ``[]`` the special forms and special characters lose
 179   their meanings and only the syntaxes described here are valid. For
 180   example, ``+``, ``*``, ``(``, ``)``, and so on are treated as
 181   literals inside ``[]``, and backreferences cannot be used inside
 182   ``[]``.
 183
 184``'|'``
 185   ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
 186   will match either A or B.  An arbitrary number of REs can be separated by the
 187   ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
 188   the target string is scanned, REs separated by ``'|'`` are tried from left to
 189   right. When one pattern completely matches, that branch is accepted. This means
 190   that once ``A`` matches, ``B`` will not be tested further, even if it would
 191   produce a longer overall match.  In other words, the ``'|'`` operator is never
 192   greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
 193   character class, as in ``[|]``.
 194
 195``(...)``
 196   Matches whatever regular expression is inside the parentheses, and indicates the
 197   start and end of a group; the contents of a group can be retrieved after a match
 198   has been performed, and can be matched later in the string with the ``\number``
 199   special sequence, described below.  To match the literals ``'('`` or ``')'``,
 200   use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
 201
 202``(?...)``
 203   This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
 204   otherwise).  The first character after the ``'?'`` determines what the meaning
 205   and further syntax of the construct is. Extensions usually do not create a new
 206   group; ``(?P<name>...)`` is the only exception to this rule. Following are the
 207   currently supported extensions.
 208
 209``(?iLmsux)``
 210   (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
 211   ``'u'``, ``'x'``.)  The group matches the empty string; the letters
 212   set the corresponding flags: :const:`re.I` (ignore case),
 213   :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
 214   :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
 215   and :const:`re.X` (verbose), for the entire regular expression. (The
 216   flags are described in :ref:`contents-of-module-re`.) This
 217   is useful if you wish to include the flags as part of the regular
 218   expression, instead of passing a *flag* argument to the
 219   :func:`compile` function.
 220
 221   Note that the ``(?x)`` flag changes how the expression is parsed. It should be
 222   used first in the expression string, or after one or more whitespace characters.
 223   If there are non-whitespace characters before the flag, the results are
 224   undefined.
 225
 226``(?:...)``
 227   A non-grouping version of regular parentheses. Matches whatever regular
 228   expression is inside the parentheses, but the substring matched by the group
 229   *cannot* be retrieved after performing a match or referenced later in the
 230   pattern.
 231
 232``(?P<name>...)``
 233   Similar to regular parentheses, but the substring matched by the group is
 234   accessible within the rest of the regular expression via the symbolic group
 235   name *name*.  Group names must be valid Python identifiers, and each group
 236   name must be defined only once within a regular expression.  A symbolic group
 237   is also a numbered group, just as if the group were not named.  So the group
 238   named ``id`` in the example below can also be referenced as the numbered group
 239   ``1``.
 240
 241   For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
 242   referenced by its name in arguments to methods of match objects, such as
 243   ``m.group('id')`` or ``m.end('id')``, and also by name in the regular
 244   expression itself (using ``(?P=id)``) and replacement text given to
 245   ``.sub()`` (using ``\g<id>``).
 246
 247``(?P=name)``
 248   Matches whatever text was matched by the earlier group named *name*.
 249
 250``(?#...)``
 251   A comment; the contents of the parentheses are simply ignored.
 252
 253``(?=...)``
 254   Matches if ``...`` matches next, but doesn't consume any of the string.  This is
 255   called a lookahead assertion.  For example, ``Isaac (?=Asimov)`` will match
 256   ``'Isaac '`` only if it's followed by ``'Asimov'``.
 257
 258``(?!...)``
 259   Matches if ``...`` doesn't match next.  This is a negative lookahead assertion.
 260   For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
 261   followed by ``'Asimov'``.
 262
 263``(?<=...)``
 264   Matches if the current position in the string is preceded by a match for ``...``
 265   that ends at the current position.  This is called a :dfn:`positive lookbehind
 266   assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
 267   lookbehind will back up 3 characters and check if the contained pattern matches.
 268   The contained pattern must only match strings of some fixed length, meaning that
 269   ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Note that
 270   patterns which start with positive lookbehind assertions will never match at the
 271   beginning of the string being searched; you will most likely want to use the
 272   :func:`search` function rather than the :func:`match` function:
 273
 274      >>> import re
 275      >>> m = re.search('(?<=abc)def', 'abcdef')
 276      >>> m.group(0)
 277      'def'
 278
 279   This example looks for a word following a hyphen:
 280
 281      >>> m = re.search('(?<=-)\w+', 'spam-egg')
 282      >>> m.group(0)
 283      'egg'
 284
 285``(?<!...)``
 286   Matches if the current position in the string is not preceded by a match for
 287   ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
 288   positive lookbehind assertions, the contained pattern must only match strings of
 289   some fixed length.  Patterns which start with negative lookbehind assertions may
 290   match at the beginning of the string being searched.
 291
 292``(?(id/name)yes-pattern|no-pattern)``
 293   Will try to match with ``yes-pattern`` if the group with given *id* or *name*
 294   exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
 295   can be omitted. For example,  ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
 296   matching pattern, which will match with ``'<user@host.com>'`` as well as
 297   ``'user@host.com'``, but not with ``'<user@host.com'``.
 298
 299   .. versionadded:: 2.4
 300
 301The special sequences consist of ``'\'`` and a character from the list below.
 302If the ordinary character is not on the list, then the resulting RE will match
 303the second character.  For example, ``\$`` matches the character ``'$'``.
 304
 305``\number``
 306   Matches the contents of the group of the same number.  Groups are numbered
 307   starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
 308   but not ``'the end'`` (note the space after the group).  This special sequence
 309   can only be used to match one of the first 99 groups.  If the first digit of
 310   *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
 311   a group match, but as the character with octal value *number*. Inside the
 312   ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
 313   characters.
 314
 315``\A``
 316   Matches only at the start of the string.
 317
 318``\b``
 319   Matches the empty string, but only at the beginning or end of a word.  A word is
 320   defined as a sequence of alphanumeric or underscore characters, so the end of a
 321   word is indicated by whitespace or a non-alphanumeric, non-underscore character.
 322   Note that  ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
 323   precise set of characters deemed to be alphanumeric depends on the values of the
 324   ``UNICODE`` and ``LOCALE`` flags.  Inside a character range, ``\b`` represents
 325   the backspace character, for compatibility with Python's string literals.
 326
 327``\B``
 328   Matches the empty string, but only when it is *not* at the beginning or end of a
 329   word.  This is just the opposite of ``\b``, so is also subject to the settings
 330   of ``LOCALE`` and ``UNICODE``.
 331
 332``\d``
 333   When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
 334   is equivalent to the set ``[0-9]``.  With :const:`UNICODE`, it will match
 335   whatever is classified as a digit in the Unicode character properties database.
 336
 337``\D``
 338   When the :const:`UNICODE` flag is not specified, matches any non-digit
 339   character; this is equivalent to the set  ``[^0-9]``.  With :const:`UNICODE`, it
 340   will match  anything other than character marked as digits in the Unicode
 341   character  properties database.
 342
 343``\s``
 344   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
 345   any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
 346   :const:`LOCALE`, it will match this set plus whatever characters are defined as
 347   space for the current locale. If :const:`UNICODE` is set, this will match the
 348   characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
 349   character properties database.
 350
 351``\S``
 352   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
 353   any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
 354   With :const:`LOCALE`, it will match any character not in this set, and not
 355   defined as space in the current locale. If :const:`UNICODE` is set, this will
 356   match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
 357   the Unicode character properties database.
 358
 359``\w``
 360   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
 361   any alphanumeric character and the underscore; this is equivalent to the set
 362   ``[a-zA-Z0-9_]``.  With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
 363   whatever characters are defined as alphanumeric for the current locale.  If
 364   :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
 365   is classified as alphanumeric in the Unicode character properties database.
 366
 367``\W``
 368   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
 369   any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
 370   With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
 371   not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
 372   this will match anything other than ``[0-9_]`` and characters marked as
 373   alphanumeric in the Unicode character properties database.
 374
 375``\Z``
 376   Matches only at the end of the string.
 377
 378Most of the standard escapes supported by Python string literals are also
 379accepted by the regular expression parser::
 380
 381   \a      \b      \f      \n
 382   \r      \t      \v      \x
 383   \\
 384
 385Octal escapes are included in a limited form: If the first digit is a 0, or if
 386there are three octal digits, it is considered an octal escape. Otherwise, it is
 387a group reference.  As for string literals, octal escapes are always at most
 388three digits in length.
 389
 390
 391.. _matching-searching:
 392
 393Matching vs Searching
 394---------------------
 395
 396.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
 397
 398
 399Python offers two different primitive operations based on regular expressions:
 400**match** checks for a match only at the beginning of the string, while
 401**search** checks for a match anywhere in the string (this is what Perl does
 402by default).
 403
 404Note that match may differ from search even when using a regular expression
 405beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
 406:const:`MULTILINE` mode also immediately following a newline.  The "match"
 407operation succeeds only if the pattern matches at the start of the string
 408regardless of mode, or at the starting position given by the optional *pos*
 409argument regardless of whether a newline precedes it.
 410
 411   >>> re.match("c", "abcdef")  # No match
 412   >>> re.search("c", "abcdef") # Match
 413   <_sre.SRE_Match object at ...>
 414
 415
 416.. _contents-of-module-re:
 417
 418Module Contents
 419---------------
 420
 421The module defines several functions, constants, and an exception. Some of the
 422functions are simplified versions of the full featured methods for compiled
 423regular expressions.  Most non-trivial applications always use the compiled
 424form.
 425
 426
 427.. function:: compile(pattern[, flags])
 428
 429   Compile a regular expression pattern into a regular expression object, which
 430   can be used for matching using its :func:`match` and :func:`search` methods,
 431   described below.
 432
 433   The expression's behaviour can be modified by specifying a *flags* value.
 434   Values can be any of the following variables, combined using bitwise OR (the
 435   ``|`` operator).
 436
 437   The sequence ::
 438
 439      prog = re.compile(pattern)
 440      result = prog.match(string)
 441
 442   is equivalent to ::
 443
 444      result = re.match(pattern, string)
 445
 446   but using :func:`compile` and saving the resulting regular expression object
 447   for reuse is more efficient when the expression will be used several times
 448   in a single program.
 449
 450   .. note::
 451
 452      The compiled versions of the most recent patterns passed to
 453      :func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
 454      programs that use only a few regular expressions at a time needn't worry
 455      about compiling regular expressions.
 456
 457
 458.. data:: I
 459          IGNORECASE
 460
 461   Perform case-insensitive matching; expressions like ``[A-Z]`` will match
 462   lowercase letters, too.  This is not affected by the current locale.
 463
 464
 465.. data:: L
 466          LOCALE
 467
 468   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
 469   current locale.
 470
 471
 472.. data:: M
 473          MULTILINE
 474
 475   When specified, the pattern character ``'^'`` matches at the beginning of the
 476   string and at the beginning of each line (immediately following each newline);
 477   and the pattern character ``'$'`` matches at the end of the string and at the
 478   end of each line (immediately preceding each newline).  By default, ``'^'``
 479   matches only at the beginning of the string, and ``'$'`` only at the end of the
 480   string and immediately before the newline (if any) at the end of the string.
 481
 482
 483.. data:: S
 484          DOTALL
 485
 486   Make the ``'.'`` special character match any character at all, including a
 487   newline; without this flag, ``'.'`` will match anything *except* a newline.
 488
 489
 490.. data:: U
 491          UNICODE
 492
 493   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
 494   on the Unicode character properties database.
 495
 496   .. versionadded:: 2.0
 497
 498
 499.. data:: X
 500          VERBOSE
 501
 502   This flag allows you to write regular expressions that look nicer. Whitespace
 503   within the pattern is ignored, except when in a character class or preceded by
 504   an unescaped backslash, and, when a line contains a ``'#'`` neither in a
 505   character class or preceded by an unescaped backslash, all characters from the
 506   leftmost such ``'#'`` through the end of the line are ignored.
 507
 508   That means that the two following regular expression objects that match a
 509   decimal number are functionally equal::
 510
 511      a = re.compile(r"""\d +  # the integral part
 512                         \.    # the decimal point
 513                         \d *  # some fractional digits""", re.X)
 514      b = re.compile(r"\d+\.\d*")
 515
 516
 517.. function:: search(pattern, string[, flags])
 518
 519   Scan through *string* looking for a location where the regular expression
 520   *pattern* produces a match, and return a corresponding :class:`MatchObject`
 521   instance. Return ``None`` if no position in the string matches the pattern; note
 522   that this is different from finding a zero-length match at some point in the
 523   string.
 524
 525
 526.. function:: match(pattern, string[, flags])
 527
 528   If zero or more characters at the beginning of *string* match the regular
 529   expression *pattern*, return a corresponding :class:`MatchObject` instance.
 530   Return ``None`` if the string does not match the pattern; note that this is
 531   different from a zero-length match.
 532
 533   .. note::
 534
 535      If you want to locate a match anywhere in *string*, use :meth:`search`
 536      instead.
 537
 538
 539.. function:: split(pattern, string[, maxsplit=0])
 540
 541   Split *string* by the occurrences of *pattern*.  If capturing parentheses are
 542   used in *pattern*, then the text of all groups in the pattern are also returned
 543   as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
 544   splits occur, and the remainder of the string is returned as the final element
 545   of the list.  (Incompatibility note: in the original Python 1.5 release,
 546   *maxsplit* was ignored.  This has been fixed in later releases.)
 547
 548      >>> re.split('\W+', 'Words, words, words.')
 549      ['Words', 'words', 'words', '']
 550      >>> re.split('(\W+)', 'Words, words, words.')
 551      ['Words', ', ', 'words', ', ', 'words', '.', '']
 552      >>> re.split('\W+', 'Words, words, words.', 1)
 553      ['Words', 'words, words.']
 554
 555   If there are capturing groups in the separator and it matches at the start of
 556   the string, the result will start with an empty string.  The same holds for
 557   the end of the string:
 558
 559      >>> re.split('(\W+)', '...words, words...')
 560      ['', '...', 'words', ', ', 'words', '...', '']
 561
 562   That way, separator components are always found at the same relative
 563   indices within the result list (e.g., if there's one capturing group
 564   in the separator, the 0th, the 2nd and so forth).
 565
 566   Note that *split* will never split a string on an empty pattern match.
 567   For example:
 568
 569      >>> re.split('x*', 'foo')
 570      ['foo']
 571      >>> re.split("(?m)^$", "foo\n\nbar\n")
 572      ['foo\n\nbar\n']
 573
 574
 575.. function:: findall(pattern, string[, flags])
 576
 577   Return all non-overlapping matches of *pattern* in *string*, as a list of
 578   strings.  The *string* is scanned left-to-right, and matches are returned in
 579   the order found.  If one or more groups are present in the pattern, return a
 580   list of groups; this will be a list of tuples if the pattern has more than
 581   one group.  Empty matches are included in the result unless they touch the
 582   beginning of another match.
 583
 584   .. versionadded:: 1.5.2
 585
 586   .. versionchanged:: 2.4
 587      Added the optional flags argument.
 588
 589
 590.. function:: finditer(pattern, string[, flags])
 591
 592   Return an :term:`iterator` yielding :class:`MatchObject` instances over all
 593   non-overlapping matches for the RE *pattern* in *string*.  The *string* is
 594   scanned left-to-right, and matches are returned in the order found.  Empty
 595   matches are included in the result unless they touch the beginning of another
 596   match.
 597
 598   .. versionadded:: 2.2
 599
 600   .. versionchanged:: 2.4
 601      Added the optional flags argument.
 602
 603
 604.. function:: sub(pattern, repl, string[, count])
 605
 606   Return the string obtained by replacing the leftmost non-overlapping occurrences
 607   of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
 608   *string* is returned unchanged.  *repl* can be a string or a function; if it is
 609   a string, any backslash escapes in it are processed.  That is, ``\n`` is
 610   converted to a single newline character, ``\r`` is converted to a linefeed, and
 611   so forth.  Unknown escapes such as ``\j`` are left alone.  Backreferences, such
 612   as ``\6``, are replaced with the substring matched by group 6 in the pattern.
 613   For example:
 614
 615      >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
 616      ...        r'static PyObject*\npy_\1(void)\n{',
 617      ...        'def myfunc():')
 618      'static PyObject*\npy_myfunc(void)\n{'
 619
 620   If *repl* is a function, it is called for every non-overlapping occurrence of
 621   *pattern*.  The function takes a single match object argument, and returns the
 622   replacement string.  For example:
 623
 624      >>> def dashrepl(matchobj):
 625      ...     if matchobj.group(0) == '-': return ' '
 626      ...     else: return '-'
 627      >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
 628      'pro--gram files'
 629
 630   The pattern may be a string or an RE object; if you need to specify regular
 631   expression flags, you must use a RE object, or use embedded modifiers in a
 632   pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
 633
 634   The optional argument *count* is the maximum number of pattern occurrences to be
 635   replaced; *count* must be a non-negative integer.  If omitted or zero, all
 636   occurrences will be replaced. Empty matches for the pattern are replaced only
 637   when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
 638   ``'-a-b-c-'``.
 639
 640   In addition to character escapes and backreferences as described above,
 641   ``\g<name>`` will use the substring matched by the group named ``name``, as
 642   defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
 643   group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
 644   in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
 645   reference to group 20, not a reference to group 2 followed by the literal
 646   character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
 647   substring matched by the RE.
 648
 649
 650.. function:: subn(pattern, repl, string[, count])
 651
 652   Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
 653   number_of_subs_made)``.
 654
 655
 656.. function:: escape(string)
 657
 658   Return *string* with all non-alphanumerics backslashed; this is useful if you
 659   want to match an arbitrary literal string that may have regular expression
 660   metacharacters in it.
 661
 662
 663.. exception:: error
 664
 665   Exception raised when a string passed to one of the functions here is not a
 666   valid regular expression (for example, it might contain unmatched parentheses)
 667   or when some other error occurs during compilation or matching.  It is never an
 668   error if a string contains no match for a pattern.
 669
 670
 671.. _re-objects:
 672
 673Regular Expression Objects
 674--------------------------
 675
 676Compiled regular expression objects support the following methods and
 677attributes:
 678
 679
 680.. method:: RegexObject.match(string[, pos[, endpos]])
 681
 682   If zero or more characters at the beginning of *string* match this regular
 683   expression, return a corresponding :class:`MatchObject` instance.  Return
 684   ``None`` if the string does not match the pattern; note that this is different
 685   from a zero-length match.
 686
 687   .. note::
 688
 689      If you want to locate a match anywhere in *string*, use :meth:`search`
 690      instead.
 691
 692   The optional second parameter *pos* gives an index in the string where the
 693   search is to start; it defaults to ``0``.  This is not completely equivalent to
 694   slicing the string; the ``'^'`` pattern character matches at the real beginning
 695   of the string and at positions just after a newline, but not necessarily at the
 696   index where the search is to start.
 697
 698   The optional parameter *endpos* limits how far the string will be searched; it
 699   will be as if the string is *endpos* characters long, so only the characters
 700   from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
 701   than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
 702   expression object, ``rx.match(string, 0, 50)`` is equivalent to
 703   ``rx.match(string[:50], 0)``.
 704
 705      >>> pattern = re.compile("o")
 706      >>> pattern.match("dog")      # No match as "o" is not at the start of "dog."
 707      >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
 708      <_sre.SRE_Match object at ...>
 709
 710
 711.. method:: RegexObject.search(string[, pos[, endpos]])
 712
 713   Scan through *string* looking for a location where this regular expression
 714   produces a match, and return a corresponding :class:`MatchObject` instance.
 715   Return ``None`` if no position in the string matches the pattern; note that this
 716   is different from finding a zero-length match at some point in the string.
 717
 718   The optional *pos* and *endpos* parameters have the same meaning as for the
 719   :meth:`match` method.
 720
 721
 722.. method:: RegexObject.split(string[, maxsplit=0])
 723
 724   Identical to the :func:`split` function, using the compiled pattern.
 725
 726
 727.. method:: RegexObject.findall(string[, pos[, endpos]])
 728
 729   Identical to the :func:`findall` function, using the compiled pattern.
 730
 731
 732.. method:: RegexObject.finditer(string[, pos[, endpos]])
 733
 734   Identical to the :func:`finditer` function, using the compiled pattern.
 735
 736
 737.. method:: RegexObject.sub(repl, string[, count=0])
 738
 739   Identical to the :func:`sub` function, using the compiled pattern.
 740
 741
 742.. method:: RegexObject.subn(repl, string[, count=0])
 743
 744   Identical to the :func:`subn` function, using the compiled pattern.
 745
 746
 747.. attribute:: RegexObject.flags
 748
 749   The flags argument used when the RE object was compiled, or ``0`` if no flags
 750   were provided.
 751
 752
 753.. attribute:: RegexObject.groups
 754
 755   The number of capturing groups in the pattern.
 756
 757
 758.. attribute:: RegexObject.groupindex
 759
 760   A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
 761   numbers.  The dictionary is empty if no symbolic groups were used in the
 762   pattern.
 763
 764
 765.. attribute:: RegexObject.pattern
 766
 767   The pattern string from which the RE object was compiled.
 768
 769
 770.. _match-objects:
 771
 772Match Objects
 773-------------
 774
 775Match objects always have a boolean value of :const:`True`, so that you can test
 776whether e.g. :func:`match` resulted in a match with a simple if statement.  They
 777support the following methods and attributes:
 778
 779
 780.. method:: MatchObject.expand(template)
 781
 782   Return the string obtained by doing backslash substitution on the template
 783   string *template*, as done by the :meth:`sub` method. Escapes such as ``\n`` are
 784   converted to the appropriate characters, and numeric backreferences (``\1``,
 785   ``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
 786   contents of the corresponding group.
 787
 788
 789.. method:: MatchObject.group([group1, ...])
 790
 791   Returns one or more subgroups of the match.  If there is a single argument, the
 792   result is a single string; if there are multiple arguments, the result is a
 793   tuple with one item per argument. Without arguments, *group1* defaults to zero
 794   (the whole match is returned). If a *groupN* argument is zero, the corresponding
 795   return value is the entire matching string; if it is in the inclusive range
 796   [1..99], it is the string matching the corresponding parenthesized group.  If a
 797   group number is negative or larger than the number of groups defined in the
 798   pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
 799   part of the pattern that did not match, the corresponding result is ``None``.
 800   If a group is contained in a part of the pattern that matched multiple times,
 801   the last match is returned.
 802
 803      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
 804      >>> m.group(0)       # The entire match
 805      'Isaac Newton'
 806      >>> m.group(1)       # The first parenthesized subgroup.
 807      'Isaac'
 808      >>> m.group(2)       # The second parenthesized subgroup.
 809      'Newton'
 810      >>> m.group(1, 2)    # Multiple arguments give us a tuple.
 811      ('Isaac', 'Newton')
 812
 813   If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
 814   arguments may also be strings identifying groups by their group name.  If a
 815   string argument is not used as a group name in the pattern, an :exc:`IndexError`
 816   exception is raised.
 817
 818   A moderately complicated example:
 819
 820      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
 821      >>> m.group('first_name')
 822      'Malcom'
 823      >>> m.group('last_name')
 824      'Reynolds'
 825
 826   Named groups can also be referred to by their index:
 827
 828      >>> m.group(1)
 829      'Malcom'
 830      >>> m.group(2)
 831      'Reynolds'
 832
 833   If a group matches multiple times, only the last match is accessible:
 834
 835      >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
 836      >>> m.group(1)                        # Returns only the last match.
 837      'c3'
 838
 839
 840.. method:: MatchObject.groups([default])
 841
 842   Return a tuple containing all the subgroups of the match, from 1 up to however
 843   many groups are in the pattern.  The *default* argument is used for groups that
 844   did not participate in the match; it defaults to ``None``.  (Incompatibility
 845   note: in the original Python 1.5 release, if the tuple was one element long, a
 846   string would be returned instead.  In later versions (from 1.5.1 on), a
 847   singleton tuple is returned in such cases.)
 848
 849   For example:
 850
 851      >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
 852      >>> m.groups()
 853      ('24', '1632')
 854
 855   If we make the decimal place and everything after it optional, not all groups
 856   might participate in the match.  These groups will default to ``None`` unless
 857   the *default* argument is given:
 858
 859      >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
 860      >>> m.groups()      # Second group defaults to None.
 861      ('24', None)
 862      >>> m.groups('0')   # Now, the second group defaults to '0'.
 863      ('24', '0')
 864
 865
 866.. method:: MatchObject.groupdict([default])
 867
 868   Return a dictionary containing all the *named* subgroups of the match, keyed by
 869   the subgroup name.  The *default* argument is used for groups that did not
 870   participate in the match; it defaults to ``None``.  For example:
 871
 872      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
 873      >>> m.groupdict()
 874      {'first_name': 'Malcom', 'last_name': 'Reynolds'}
 875
 876
 877.. method:: MatchObject.start([group])
 878            MatchObject.end([group])
 879
 880   Return the indices of the start and end of the substring matched by *group*;
 881   *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
 882   *group* exists but did not contribute to the match.  For a match object *m*, and
 883   a group *g* that did contribute to the match, the substring matched by group *g*
 884   (equivalent to ``m.group(g)``) is ::
 885
 886      m.string[m.start(g):m.end(g)]
 887
 888   Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
 889   null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
 890   ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
 891   2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
 892
 893   An example that will remove *remove_this* from email addresses:
 894
 895      >>> email = "tony@tiremove_thisger.net"
 896      >>> m = re.search("remove_this", email)
 897      >>> email[:m.start()] + email[m.end():]
 898      'tony@tiger.net'
 899
 900
 901.. method:: MatchObject.span([group])
 902
 903   For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
 904   m.end(group))``. Note that if *group* did not contribute to the match, this is
 905   ``(-1, -1)``.  *group* defaults to zero, the entire match.
 906
 907
 908.. attribute:: MatchObject.pos
 909
 910   The value of *pos* which was passed to the :func:`search` or :func:`match`
 911   method of the :class:`RegexObject`.  This is the index into the string at which
 912   the RE engine started looking for a match.
 913
 914
 915.. attribute:: MatchObject.endpos
 916
 917   The value of *endpos* which was passed to the :func:`search` or :func:`match`
 918   method of the :class:`RegexObject`.  This is the index into the string beyond
 919   which the RE engine will not go.
 920
 921
 922.. attribute:: MatchObject.lastindex
 923
 924   The integer index of the last matched capturing group, or ``None`` if no group
 925   was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
 926   ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
 927   the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
 928   string.
 929
 930
 931.. attribute:: MatchObject.lastgroup
 932
 933   The name of the last matched capturing group, or ``None`` if the group didn't
 934   have a name, or if no group was matched at all.
 935
 936
 937.. attribute:: MatchObject.re
 938
 939   The regular expression object whose :meth:`match` or :meth:`search` method
 940   produced this :class:`MatchObject` instance.
 941
 942
 943.. attribute:: MatchObject.string
 944
 945   The string passed to :func:`match` or :func:`search`.
 946
 947
 948Examples
 949--------
 950
 951
 952Checking For a Pair
 953^^^^^^^^^^^^^^^^^^^
 954
 955In this example, we'll use the following helper function to display match
 956objects a little more gracefully:
 957
 958.. testcode::
 959
 960   def displaymatch(match):
 961       if match is None:
 962           return None
 963       return '<Match: %r, groups=%r>' % (match.group(), match.groups())
 964
 965Suppose you are writing a poker program where a player's hand is represented as
 966a 5-character string with each character representing a card, "a" for ace, "k"
 967for king, "q" for queen, j for jack, "0" for 10, and "1" through "9"
 968representing the card with that value.
 969
 970To see if a given string is a valid hand, one could do the following:
 971
 972   >>> valid = re.compile(r"[0-9akqj]{5}$")
 973   >>> displaymatch(valid.match("ak05q"))  # Valid.
 974   "<Match: 'ak05q', groups=()>"
 975   >>> displaymatch(valid.match("ak05e"))  # Invalid.
 976   >>> displaymatch(valid.match("ak0"))    # Invalid.
 977   >>> displaymatch(valid.match("727ak"))  # Valid.
 978   "<Match: '727ak', groups=()>"
 979
 980That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
 981To match this with a regular expression, one could use backreferences as such:
 982
 983   >>> pair = re.compile(r".*(.).*\1")
 984   >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
 985   "<Match: '717', groups=('7',)>"
 986   >>> displaymatch(pair.match("718ak"))     # No pairs.
 987   >>> displaymatch(pair.match("354aa"))     # Pair of aces.
 988   "<Match: '354aa', groups=('a',)>"
 989
 990To find out what card the pair consists of, one could use the :func:`group`
 991method of :class:`MatchObject` in the following manner:
 992
 993.. doctest::
 994
 995   >>> pair.match("717ak").group(1)
 996   '7'
 997
 998   # Error because re.match() returns None, which doesn't have a group() method:
 999   >>> pair.match("718ak").group(1)
1000   Traceback (most recent call last):
1001     File "<pyshell#23>", line 1, in <module>
1002       re.match(r".*(.).*\1", "718ak").group(1)
1003   AttributeError: 'NoneType' object has no attribute 'group'
1004
1005   >>> pair.match("354aa").group(1)
1006   'a'
1007
1008
1009Simulating scanf()
1010^^^^^^^^^^^^^^^^^^
1011
1012.. index:: single: scanf()
1013
1014Python does not currently have an equivalent to :cfunc:`scanf`.  Regular
1015expressions are generally more powerful, though also more verbose, than
1016:cfunc:`scanf` format strings.  The table below offers some more-or-less
1017equivalent mappings between :cfunc:`scanf` format tokens and regular
1018expressions.
1019
1020+--------------------------------+---------------------------------------------+
1021| :cfunc:`scanf` Token           | Regular Expression                          |
1022+================================+=============================================+
1023| ``%c``                         | ``.``                                       |
1024+--------------------------------+---------------------------------------------+
1025| ``%5c``                        | ``.{5}``                                    |
1026+--------------------------------+---------------------------------------------+
1027| ``%d``                         | ``[-+]?\d+``                                |
1028+--------------------------------+---------------------------------------------+
1029| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1030+--------------------------------+---------------------------------------------+
1031| ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
1032+--------------------------------+---------------------------------------------+
1033| ``%o``                         | ``0[0-7]*``                                 |
1034+--------------------------------+---------------------------------------------+
1035| ``%s``                         | ``\S+``                                     |
1036+--------------------------------+---------------------------------------------+
1037| ``%u``                         | ``\d+``                                     |
1038+--------------------------------+---------------------------------------------+
1039| ``%x``, ``%X``                 | ``0[xX][\dA-Fa-f]+``                        |
1040+--------------------------------+---------------------------------------------+
1041
1042To extract the filename and numbers from a string like ::
1043
1044   /usr/sbin/sendmail - 0 errors, 4 warnings
1045
1046you would use a :cfunc:`scanf` format like ::
1047
1048   %s - %d errors, %d warnings
1049
1050The equivalent regular expression would be ::
1051
1052   (\S+) - (\d+) errors, (\d+) warnings
1053
1054
1055Avoiding recursion
1056^^^^^^^^^^^^^^^^^^
1057
1058If you create regular expressions that require the engine to perform a lot of
1059recursion, you may encounter a :exc:`RuntimeError` exception with the message
1060``maximum recursion limit`` exceeded. For example, ::
1061
1062   >>> s = 'Begin ' + 1000*'a very long string ' + 'end'
1063   >>> re.match('Begin (\w| )*? end', s).end()
1064   Traceback (most recent call last):
1065     File "<stdin>", line 1, in ?
1066     File "/usr/local/lib/python2.5/re.py", line 132, in match
1067       return _compile(pattern, flags).match(string)
1068   RuntimeError: maximum recursion limit exceeded
1069
1070You can often restructure your regular expression to avoid recursion.
1071
1072Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
1073avoid recursion.  Thus, the above regular expression can avoid recursion by
1074being recast as ``Begin [a-zA-Z0-9_ ]*?end``.  As a further benefit, such
1075regular expressions will run faster than their recursive equivalents.
1076
1077
1078search() vs. match()
1079^^^^^^^^^^^^^^^^^^^^
1080
1081In a nutshell, :func:`match` only attempts to match a pattern at the beginning
1082of a string where :func:`search` will match a pattern anywhere in a string.
1083For example:
1084
1085   >>> re.match("o", "dog")  # No match as "o" is not the first letter of "dog".
1086   >>> re.search("o", "dog") # Match as search() looks everywhere in the string.
1087   <_sre.SRE_Match object at ...>
1088
1089.. note::
1090
1091   The following applies only to regular expression objects like those created
1092   with ``re.compile("pattern")``, not the primitives ``re.match(pattern,
1093   string)`` or ``re.search(pattern, string)``.
1094
1095:func:`match` has an optional second parameter that gives an index in the string
1096where the search is to start::
1097
1098   >>> pattern = re.compile("o")
1099   >>> pattern.match("dog")      # No match as "o" is not at the start of "dog."
1100
1101   # Equivalent to the above expression as 0 is the default starting index:
1102   >>> pattern.match("dog", 0)
1103
1104   # Match as "o" is the 2nd character of "dog" (index 0 is the first):
1105   >>> pattern.match("dog", 1)
1106   <_sre.SRE_Match object at ...>
1107   >>> pattern.match("dog", 2)   # No match as "o" is not the 3rd character of "dog."
1108
1109
1110Making a Phonebook
1111^^^^^^^^^^^^^^^^^^
1112
1113:func:`split` splits a string into a list delimited by the passed pattern.  The
1114method is invaluable for converting textual data into data structures that can be
1115easily read and modified by Python as demonstrated in the following example that
1116creates a phonebook.
1117
1118First, here is the input.  Normally it may come from a file, here we are using
1119triple-quoted string syntax:
1120
1121   >>> input = """Ross McFluff: 834.345.1254 155 Elm Street
1122   ...
1123   ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1124   ... Frank Burger: 925.541.7625 662 South Dogwood Way
1125   ...
1126   ...
1127   ... Heather Albrecht: 548.326.4584 919 Park Place"""
1128
1129The entries are separated by one or more newlines. Now we convert the string
1130into a list with each nonempty line having its own entry:
1131
1132.. doctest::
1133   :options: +NORMALIZE_WHITESPACE
1134
1135   >>> entries = re.split("\n+", input)
1136   >>> entries
1137   ['Ross McFluff: 834.345.1254 155 Elm Street',
1138   'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1139   'Frank Burger: 925.541.7625 662 South Dogwood Way',
1140   'Heather Albrecht: 548.326.4584 919 Park Place']
1141
1142Finally, split each entry into a list with first name, last name, telephone
1143number, and address.  We use the ``maxsplit`` parameter of :func:`split`
1144because the address has spaces, our splitting pattern, in it:
1145
1146.. doctest::
1147   :options: +NORMALIZE_WHITESPACE
1148
1149   >>> [re.split(":? ", entry, 3) for entry in entries]
1150   [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1151   ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1152   ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1153   ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1154
1155The ``:?`` pattern matches the colon after the last name, so that it does not
1156occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
1157house number from the street name:
1158
1159.. doctest::
1160   :options: +NORMALIZE_WHITESPACE
1161
1162   >>> [re.split(":? ", entry, 4) for entry in entries]
1163   [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1164   ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1165   ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1166   ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1167
1168
1169Text Munging
1170^^^^^^^^^^^^
1171
1172:func:`sub` replaces every occurrence of a pattern with a string or the
1173result of a function.  This example demonstrates using :func:`sub` with
1174a function to "munge" text, or randomize the order of all the characters
1175in each word of a sentence except for the first and last characters::
1176
1177   >>> def repl(m):
1178   ...   inner_word = list(m.group(2))
1179   ...   random.shuffle(inner_word)
1180   ...   return m.group(1) + "".join(inner_word) + m.group(3)
1181   >>> text = "Professor Abdolmalek, please report your absences promptly."
1182   >>> re.sub("(\w)(\w+)(\w)", repl, text)
1183   'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
1184   >>> re.sub("(\w)(\w+)(\w)", repl, text)
1185   'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1186
1187
1188Finding all Adverbs
1189^^^^^^^^^^^^^^^^^^^
1190
1191:func:`findall` matches *all* occurrences of a pattern, not just the first
1192one as :f…

Large files files are truncated, but you can click here to view the full file