PageRenderTime 118ms CodeModel.GetById 25ms RepoModel.GetById 5ms app.codeStats 0ms

/docs/blosxom/training/2010-10-17-npes-3.bxm

https://github.com/Altreus/podcats
Unknown | 290 lines | 224 code | 66 blank | 0 comment | 0 complexity | 0a1270ff9126e6c2051e7cf8201c9e33 MD5 | raw file
  1. New Programmers' Elucidation Series - 3: Regexes
  2. The regular expression appears to be black magic to the newbie programmer and
  3. that's because it is. In this episode, we will look at regular expressions first
  4. in grep, then in Perl.
  5. I will avoid going deep into regular expression syntax, because there are myriad
  6. things on the internet that will tell you what you need to know about that. In
  7. this post I am going to clear up some of the things I recall being confused
  8. about where regexes were involved.
  9. [h]Regexes in grep are yes or no[/h]
  10. This was not explained to me when I first came across them. [b]Regexes give a
  11. yes or no answer[/b]. When you are using [c]grep[/c] on the command line, the
  12. regex is applied to each line in the file and [i]used as truth[/i].
  13. When you are using a language such as sed or Perl, that is when regexes are able
  14. to do things like find out [i]how[/i] the "yes" answer came about. You need a
  15. language that can actually save things in variables for later use. When you are
  16. using a regex in grep, however, you are looking for [b]lines that match[/b].
  17. Here is an example that shows the most basic use of a regex.
  18. [shell]~$ grep -P '/\d/' code/podcats.in/cgi-bin/blosxom.cgi
  19. depth => 0,
  20. num_entries => 10,
  21. show_future_entries => 0,
  22. require_namespace => 1,
  23. $path =~ s/\.(\w+)$// and $flavour = $1; [/shell]
  24. The [c]-P[/c] makes grep use a Perl regex magic[fn]Regex magic is a term used to
  25. define how the regular expression deals with special characters such as
  26. [c]+[/c], [c]?[/c] and [c]*[/c]. In Perl, these things default to being special,
  27. but in other regex dialects, you must use [c]\[/c] to get them to do the same
  28. job.[/fn], including the [c]\d[/c] shortcut for "a digit". Because this is grep,
  29. we return every line for which this regex gives a "yes" response. The regex is
  30. asking for "a digit". Grep obliges by showing us which lines have at least one
  31. digit in them.
  32. [h]Why we use regexes[/h]
  33. The previous example, without a regex, would require concatenating the output of
  34. ten greps, and even with that the output would not be identical. The regular
  35. expression concept allows us to express a [i]pattern[/i] for our searches. The
  36. 'digit' pattern is a simple example.
  37. The important part about a pattern is that you can specify a potentially
  38. infinite set of matches with a single expression. The expression [c]/./[/c], for
  39. example, will match every single string except the empty one. Its real-world
  40. potential is limited; the only thing it would [i]not[/i] match is the empty
  41. string, and you can find that more easily by inverting your results to find
  42. those that [i]don't[/i] match[fn][c]grep -v '/^$/'[/c][/fn], but the fact that
  43. this expression can match almost the entire set of possible strings should show
  44. how powerful they can be.
  45. [h]The expression will match anywhere in the string[/h]
  46. One of the scariest parts of regexes is their complexity. This is often borne of
  47. an insufficient understanding of them. In many cases, a complex regular
  48. expression can be simplified purely because some of it does not have to be
  49. there. A common example of this is that people believe you have to specify the
  50. entire string. The grep example above should have served to refute this already:
  51. the pattern in that example will be tested on all parts of the string, as the
  52. output shows.
  53. A common example of [i]this[/i] is the excessive use of [c].*[/c]. People use
  54. this to say "anything", which, in fairness, it does mean. The thing is, you only
  55. need to say "anything" if it is between "specific things": [c]/^.*abc.*$/[/c] is
  56. better written as [c]/abc/[/c], but [c]/^a.*b$/[/c] cannot be written another
  57. way, except with two regexes.
  58. [h]Be As Vague As Possible, But No Vaguer[/h]
  59. When dealing with regular expressions it is necessary to employ logic, because
  60. it is programming and programming involves logic. The logic we employ here is
  61. the type that gets us from a few examples of what we're looking for to a regular
  62. expression that matches what we want. Sometimes you may in fact require two or
  63. more regular expressions, and there's no harm in running the same string through
  64. several in order to determine whether it survives your scrutiny or not.
  65. When trying to find the regex that suitably matches your data you should
  66. consider other data that should not match. It may be the case that you need to
  67. add (or even remove) some of your pattern in order to avoid false positives or
  68. false negatives. One generally reads regexes by saying "and then a" a lot. This
  69. is because if you read them out to yourself like this and check your sample data
  70. along the way, you will spot where what you have just said does not ring true.
  71. [h]False positives[/h]
  72. False positives are more common than false negatives because it is easy to use a
  73. regex to match against a broad set of characters when you really need to match a
  74. much narrower set. Plus, it is not usually possible to use the regex to actually
  75. test when the sets required depend on other sets found in the regex. This leads
  76. to broader regex definitions than is strictly necessary, but it is usually
  77. offset by more logic in the language itself, checking the results.
  78. Let's show a real-world example. We have a set of output files of processes that
  79. are running, and we can determine whether they have finished or not by looking
  80. for a date stamp at the end of each file.
  81. We know that a date is four digits, then two, then two, each group separated by
  82. a dash. So we can easily express this as a regex because we have a term that
  83. means 'a digit':
  84. [c]/\d\d\d\d-\d\d-\d\d/[/c]
  85. Probably better written as
  86. [c]/\d{4}-\d{2}-\d{2}/[/c]
  87. The question is, what else will it match? For a start, there is no validation,
  88. so even if it does accurately find a date, it is still incorrect to find a date
  89. that is in the future.
  90. What if the log is also logging user input, and the user input contains a date
  91. in this format? Or it is simply timestamping each log entry? What if the log
  92. outputs a string like [c]000000-00-00000[/c]?
  93. Remember that the regex matches [i]anywhere[/i] in the string. Another common
  94. mistake is to believe that, for example, if 4 digits are requested, 5 digits
  95. will not match. This is a fallacy: if you ask for 4 digits, 5 digits will match
  96. for the simple reason that it has four digits in it.
  97. A better way to write this regex would involve [i]anchors[/i]. Anchors are
  98. tokens that refer to parts of the string that always exist, and always in the
  99. same place. Specifically, the start and end of the string.
  100. [c]/^\d{4}-\d{2}-\d{2}$/[/c]
  101. How have we solved our false matches? Well, if the date appears [i]anywhere[/i]
  102. on the string, it will not match, because the start of the string is not found
  103. immediately before the date, and the end of the string is not found immediately
  104. after the date. That solves the timestamp problem and the user input problems.
  105. The future date problem has not been solved, and really can't be solved without
  106. a language that can compare the date to today's date.
  107. [h]False negatives[/h]
  108. Validating user input is a place where false negatives come into play. False
  109. negatives tend to mean you shouldn't've used a regex in the first place. The
  110. most common real-world examples of not using a regex are HTML and email
  111. addresses.
  112. Many people will try to validate an email address like this:
  113. [c]/[^@]+@[^.]+\.[^.]+(\.[^.]+)*/[/c]
  114. That means, first, at least one character ([c]+[/c]) that is not an @
  115. ([c][^@][/c]). Then the @. Then at least one character that is not a dot
  116. ([c][^.]+[/c]), then an actual dot (the [c].[/c] must be escaped with [c]\[/c]
  117. because in a regex it means "any character" otherwise), and then the
  118. dot-then-not-a-dot sequence zero or more times ([c]*[/c]). The parentheses group
  119. the dot-then-not-a-dot sequence together so the asterisk applies to the whole
  120. lot, rather than just the previous one.
  121. This works fine for many email addresses, including ones with a plus in them,
  122. which is becoming more and more common these days because Google Mail uses it
  123. for a feature. But it doesn't take into account the fact that email addresses
  124. can contain the @ symbol if it is quoted, but can't contain other symbols either
  125. if it is not. Nor does it care that anything can come after the @, for example
  126. "localhost", without requiring a dot in it at all.
  127. The list of problems with this regex is extensive, because the RFC on email
  128. addresses is nightmarish.
  129. [url=http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html]Here[/url] is the
  130. regex you need in order to validate an email address correctly. If you don't use
  131. it, your email address regex will certainly come up with false negatives[fn]At a
  132. previous job, we had an issue where we took details from another company. One
  133. woman's email address was, fairly, her name, but her name had an apostrophe in
  134. it, being French or so. Anyway, the email validator would not accept the
  135. apostrophe, but the checks were bypassed when we put the data into the database,
  136. meaning that she had an account but couldn't log in because the checks were run
  137. to validate user input to the login form. A definite false not.[/fn].
  138. [h]Getting data back[/h]
  139. When you are not using grep but using something like sed or, gasp, Perl, you can
  140. use parentheses to find out what the actual data were in the first place. We
  141. have already seen how a regex matches anywhere in a line; with capturing, you
  142. can say "Which bit?". Let's take the logfile date example again.
  143. [c]/^\d{4}-\d{2}-\d{2}$/[/c]
  144. Recall that this matched a date, on a line, on its own; hence the anchors. But
  145. it didn't test that the date was not in the future. With the addition of two
  146. parentheses we can turn this into a regex that not only says yes and no, but
  147. tells us how it came up with the yes.
  148. [c]/^(\d{4}-\d{2}-\d{2})$/[/c]
  149. Starting to see why regexes can be so complex. Every character has a cryptic
  150. meaning! This is just something you have to learn over time; they try to be
  151. consistent and logical about it where possible though. Anywho, when we run this
  152. regex in Perl against a string we will find it populates the variable $1 with
  153. the date that it matched.
  154. [perl]while (my $str = <$logfile>) {
  155. if ($str =~ /^(\d{4}-\d{2}-\d{2})$/) {
  156. my $date = $1;
  157. my $today = strftime "%Y-%m-%d", localtime;
  158. if ($date le $today) {
  159. # hooray! date is in the past
  160. }
  161. }
  162. }
  163. [/perl]
  164. If it didn't match, of course, [c]$1[/c] will keep the previous value it had,
  165. which starts off as undef, and the [c]if[/c] block will not run.
  166. Using multiple sets of parentheses will set other values; [c]$2[/c], [c]$3[/c]
  167. etc. In fact these are available in the regex itself as [c]\1[/c] [c]\2[/c] etc;
  168. you can use them to test whether the same thing appears twice.
  169. Here is a naive way of finding strings:
  170. [c]/((['"]).+?\1)/[/c]
  171. [c]$1[/c] will contain the delimiter ([c]'[/c] or [c]"[/c]) and $2 will contain
  172. the whole string. The problem with this, of course, is it ignores the fact that
  173. you can escape the delimiter with [c]\"[/c]. In the real world you should not
  174. use this as a way of finding quoted strings; otherwise there would not be a
  175. module for it; but it shows the use of the [b]backreference[/b].
  176. [h]Finding many matches[/h]
  177. The [c]/g[/c] modifier causes the match to be applied globally, which means if
  178. it matches once it'll try again, picking up from where it left off. This is
  179. useful if, like in the previous example, you may have many instances of the same
  180. thing in the string.
  181. Someone in #perl recently asked "How can I find the number of occurrences of a
  182. space in a string?". This is the use of the [c]/g[/c] modifier, and list
  183. context. You search for a space with [c]/\s/g[/c], and use list context to find
  184. all occurrences. Then you count the list! Easy.
  185. See, if you just use a regex you will get a yes-or-no answer, but if you ask for
  186. a list, you will get them all. In fact, Perl's behaviour is that if you run the
  187. same regex [i]again[/i], against the same variable, you will get the next match
  188. in $1.
  189. Let's see this in action.
  190. [perl]my $str = <$file>;
  191. while ($str =~ /\s(\d{4}-\d{2}-\d{2})\s/g) {
  192. # $1 now contains the next date found in $str
  193. }
  194. [/perl]
  195. We have revisited our naive date finder. This time, instead of using the
  196. [c]^[/c] and [c]$[/c] anchors, we are using [c]\s[/c]. This means "whitespace"
  197. and basically ensures that there is whitespace to each side of the date[fn]The
  198. whitespace token will not match the start or end of string, so there is a bug
  199. waiting to happen right there.[/fn]. Note that the parentheses do not capture
  200. the whitespace, but the regex still requires them in order to match.
  201. Alternatively to the above we can find all the dates at once by putting it in
  202. list context:
  203. [perl]my $str = <$file>
  204. if (my @matches = $str =~ /\s(\d{4}-\d{2}-\d{2})\s/g) {
  205. # @matches now contains all dates in the string.
  206. }
  207. [/perl]
  208. Now we can loop over @matches and use each entry in the same way we used $1 in
  209. the while loop.
  210. [h]Conclusion[/h]
  211. We have learned that regexes:
  212. [ul]
  213. [li]Match [b]anywhere in a string[/b][/li]
  214. [li]Are most simply used to determine a yes or no answer[/li]
  215. [li]Can easily be too broad or too narrow[/li]
  216. [li]Probably require a fair amount of experience to know all their
  217. nuances[/li]
  218. [li]Can match the same string multiple times[/li]
  219. [li]Can tell us not only that they found a match, but what it was[/li]
  220. [/ul]
  221. Next time you see a regular expression, or when you first start using them,
  222. knowing these few points should help you express what you want. Remember to try
  223. to reword the requirement of the regex into something you can express simply, or
  224. into several separate requirements if necessary.