PageRenderTime 193ms CodeModel.GetById 20ms RepoModel.GetById 1ms app.codeStats 0ms

/code/99_regex_reference.py

https://gitlab.com/varunkothamachu/DAT3
Python | 251 lines | 216 code | 2 blank | 33 comment | 9 complexity | f4ce6125ea50dae1e405f986898ede8a MD5 | raw file
  1. '''
  2. Regular Expressions (regex) Reference Guide
  3. Sources:
  4. https://developers.google.com/edu/python/regular-expressions
  5. https://docs.python.org/2/library/re.html
  6. '''
  7. '''
  8. Basic Patterns:
  9. ordinary characters match themselves exactly
  10. . matches any single character except newline \n
  11. \w matches a word character (letter, digit, underscore)
  12. \W matches any non-word character
  13. \b matches boundary between word and non-word
  14. \s matches single whitespace character (space, newline, return, tab, form)
  15. \S matches single non-whitespace character
  16. \d matches single digit (0 through 9)
  17. \t matches tab
  18. \n matches newline
  19. \r matches return
  20. \ match a special character, such as period: \.
  21. Rules for Searching:
  22. search proceeds through string from start to end, stopping at first match
  23. all of the pattern must be matched
  24. Basic Search Function:
  25. match = re.search(r'pattern', string_to_search)
  26. returns match object
  27. if there is a match, access match using match.group()
  28. if there is no match, match is None
  29. use 'r' in front of pattern to designate a raw string
  30. '''
  31. import re
  32. s = 'my 1st string!!'
  33. match = re.search(r'st', s)
  34. if match: match.group() # 'st'
  35. match = re.search(r'sta', s)
  36. if match: match.group() # None
  37. match = re.search(r'\w\w\w', s)
  38. if match: match.group() # '1st'
  39. match = re.search(r'\W', s)
  40. if match: match.group() # ' '
  41. match = re.search(r'\W\W', s)
  42. if match: match.group() # '!!'
  43. match = re.search(r'\s', s)
  44. if match: match.group() # ' '
  45. match = re.search(r'\s\s', s)
  46. if match: match.group() # None
  47. match = re.search(r'..t', s)
  48. if match: match.group() # '1st'
  49. match = re.search(r'\s\St', s)
  50. if match: match.group() # ' st'
  51. match = re.search(r'\bst', s)
  52. if match: match.group() # 'st'
  53. '''
  54. Positions:
  55. ^ match start of a string
  56. $ match end of a string
  57. '''
  58. s = 'sid is missing class'
  59. match = re.search(r'^miss', s)
  60. if match: match.group() # None
  61. match = re.search(r'..ss', s)
  62. if match: match.group() # 'miss'
  63. match = re.search(r'..ss$', s)
  64. if match: match.group() # 'lass'
  65. '''
  66. Repetition:
  67. + 1 or more occurrences of the pattern to its left
  68. * 0 or more occurrences
  69. ? 0 or 1 occurrence
  70. + and * are 'greedy': they try to use up as much of the string as possible
  71. add ? after + or * to make them non-greedy: +? or *?
  72. '''
  73. s = 'sid is missing class'
  74. match = re.search(r'miss\w+', s)
  75. if match: match.group() # 'missing'
  76. match = re.search(r'is\w+', s)
  77. if match: match.group() # 'issing'
  78. match = re.search(r'is\w*', s)
  79. if match: match.group() # 'is'
  80. s = '<h1>my heading</h1>'
  81. match = re.search(r'<.+>', s)
  82. if match: match.group() # '<h1>my heading</h1>'
  83. match = re.search(r'<.+?>', s)
  84. if match: match.group() # '<h1>'
  85. '''
  86. Brackets:
  87. [abc] match a or b or c
  88. \w, \s, etc. work inside brackets, except period just means a literal period
  89. [a-z] match any lowercase letter (dash indicates range unless it's last)
  90. [abc-] match a or b or c or -
  91. [^ab] match anything except a or b
  92. '''
  93. s = 'my email is john-doe@gmail.com'
  94. match = re.search(r'\w+@\w+', s)
  95. if match: match.group() # 'doe@gmail'
  96. match = re.search(r'[\w.-]+@[\w.-]+', s)
  97. if match: match.group() # 'john-doe@gmail.com'
  98. '''
  99. Lookarounds:
  100. lookahead matches a pattern only if it is followed by another pattern
  101. 100(?= dollars) matches '100' only if it is followed by ' dollars'
  102. lookbehind matches a pattern only if it is preceded by another pattern
  103. (?<=\$)100 matches '100' only if it is preceded by '$'
  104. '''
  105. s = 'Name: Cindy, 30 years old'
  106. match = re.search(r'\d+(?= years? old)', s)
  107. if match: match.group() # '30'
  108. match = re.search(r'(?<=Name: )\w+', s)
  109. if match: match.group() # 'Cindy'
  110. '''
  111. Match Groups:
  112. parentheses create logical groups inside of match text
  113. match.group(1) corresponds to first group
  114. match.group(2) corresponds to second group
  115. match.group() corresponds to entire match text (as usual)
  116. '''
  117. s = 'my email is john-doe@gmail.com'
  118. match = re.search(r'([\w.-]+)@([\w.-]+)', s)
  119. if match:
  120. match.group(1) # 'john-doe'
  121. match.group(2) # 'gmail.com'
  122. match.group() # 'john-doe@gmail.com'
  123. '''
  124. Finding All Matches:
  125. re.findall() finds all matches and returns them as a list of strings
  126. list_of_strings = re.findall(r'pattern', string_to_search)
  127. if pattern includes parentheses, a list of tuples is returned
  128. '''
  129. s = 'emails: joe@gmail.com, bob@gmail.com'
  130. re.findall(r'[\w.-]+@[\w.-]+', s) # ['joe@gmail.com', 'bob@gmail.com']
  131. re.findall(r'([\w.-]+)@([\w.-]+)', s) # [('joe', 'gmail.com'), ('bob', 'gmail.com')]
  132. '''
  133. Option Flags:
  134. options flags modify the behavior of the pattern matching
  135. default: matching is case sensitive
  136. re.IGNORECASE: ignore uppercase/lowercase differences ('a' matches 'a' or 'A')
  137. default: period matches any character except newline
  138. re.DOTALL: allow period to match newline
  139. default: within a string of many lines, ^ and $ match start and end of entire string
  140. re.MULTILINE: allow ^ and $ to match start and end of each line
  141. option flag is third argument to re.search() or re.findall()
  142. re.search(r'pattern', string_to_search, re.IGNORECASE)
  143. re.findall(r'pattern', string_to_search, re.IGNORECASE)
  144. '''
  145. s = 'emails: nicole@ga.co, joe@gmail.com, PAT@GA.CO'
  146. re.findall(r'\w+@ga\.co', s) # ['nicole@ga.co']
  147. re.findall(r'\w+@ga\.co', s, re.IGNORECASE) # ['nicole@ga.co', 'PAT@GA.CO']
  148. '''
  149. Substitution:
  150. re.sub() finds all matches and replaces them with a specified string
  151. new_string = re.sub(r'pattern', r'replacement', string_to_search)
  152. replacement string can refer to text from matching groups:
  153. \1 refers to group(1)
  154. \2 refers to group(2)
  155. etc.
  156. '''
  157. s = 'sid is missing class'
  158. re.sub(r'is ', r'was ', s) # 'sid was missing class'
  159. s = 'emails: joe@gmail.com, bob@gmail.com'
  160. re.sub(r'([\w.-]+)@([\w.-]+)', r'\1@yahoo.com', s) # 'emails: joe@yahoo.com, bob@yahoo.com'
  161. '''
  162. Useful, But Not Covered:
  163. re.split() splits a string by the occurrences of a pattern
  164. re.compile() compiles a pattern (for improved performance if it's used many times)
  165. A|B indicates a pattern that can match A or B
  166. '''