PageRenderTime 40ms CodeModel.GetById 15ms RepoModel.GetById 0ms app.codeStats 1ms

/doc/index.html

http://github.com/edicl/cl-ppcre
HTML | 2206 lines | 1861 code | 345 blank | 0 comment | 0 complexity | b8d64a620f112e30367282f0960294fa MD5 | raw file
  1. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
  2. <html>
  3. <head>
  4. <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  5. <title>CL-PPCRE - Portable Perl-compatible regular expressions for Common Lisp</title>
  6. <style type="text/css">
  7. pre { padding:5px; background-color:#e0e0e0 }
  8. h3, h4 { text-decoration: underline; }
  9. a { text-decoration: none; padding: 1px 2px 1px 2px; }
  10. a:visited { text-decoration: none; padding: 1px 2px 1px 2px; }
  11. a:hover { text-decoration: none; padding: 1px 1px 1px 1px; border: 1px solid #000000; }
  12. a:focus { text-decoration: none; padding: 1px 2px 1px 2px; border: none; }
  13. a.none { text-decoration: none; padding: 0; }
  14. a.none:visited { text-decoration: none; padding: 0; }
  15. a.none:hover { text-decoration: none; border: none; padding: 0; }
  16. a.none:focus { text-decoration: none; border: none; padding: 0; }
  17. a.noborder { text-decoration: none; padding: 0; }
  18. a.noborder:visited { text-decoration: none; padding: 0; }
  19. a.noborder:hover { text-decoration: none; border: none; padding: 0; }
  20. a.noborder:focus { text-decoration: none; border: none; padding: 0; }
  21. pre.none { padding:5px; background-color:#ffffff }
  22. </style>
  23. <meta name="description" content="Fast and portable perl-compatible regular expressions for Common Lisp.">
  24. </head>
  25. <body bgcolor=white>
  26. <h2>CL-PPCRE - Portable Perl-compatible regular expressions for Common Lisp</h2>
  27. <blockquote>
  28. <br>&nbsp;<br><h3>Abstract</h3>
  29. CL-PPCRE is a portable regular expression library for Common Lisp
  30. which has the following features:
  31. <ul>
  32. <li>It is <b>compatible with Perl</b>.
  33. <li>It is pretty <b>fast</b>.
  34. <li>It is <b>portable</b> between ANSI-compliant Common Lisp
  35. implementations.
  36. <li>It is <b>thread-safe</b>.
  37. <li>In addition to specifying regular expressions as strings like in
  38. Perl you can also use <a
  39. href="#create-scanner2"><b>S-expressions</b></a>.
  40. <li>It comes with a <a
  41. href="http://www.opensource.org/licenses/bsd-license.php"><b>BSD-style
  42. license</b></a> so you can basically do with it whatever you want.
  43. </ul>
  44. CL-PPCRE has been used successfully in various applications like <a
  45. href="http://nostoc.stanford.edu/Docs/">BioBike</a>,
  46. <a href="http://clutu.com/">clutu</a>,
  47. <a
  48. href="http://www.hpc.unm.edu/~download/LoGS/">LoGS</a>, <a href="http://cafespot.net/">CafeSpot</a>, <a href="http://www.eboy.com/">Eboy</a>, or <a
  49. href="http://weitz.de/regex-coach/">The Regex Coach</a>.
  50. <p>
  51. <font color=red>Download shortcut:</font> <a href="http://weitz.de/files/cl-ppcre.tar.gz">http://weitz.de/files/cl-ppcre.tar.gz</a>.
  52. </blockquote>
  53. <br>&nbsp;<br><h3><a class=none name="contents">Contents</a></h3>
  54. <ol>
  55. <li><a href="#install">Download and installation</a>
  56. <li><a href="#support">Support</a>
  57. <li><a href="#dict">The CL-PPCRE dictionary</a>
  58. <ol>
  59. <li><a href="#scanning">Scanning</a>
  60. <ol>
  61. <li><a href="#create-scanner"><code>create-scanner</code></a> (for Perl regex strings)
  62. <li><a href="#create-scanner2"><code>create-scanner</code></a> (for parse trees)
  63. <li><a href="#scan"><code>scan</code></a>
  64. <li><a href="#scan-to-strings"><code>scan-to-strings</code></a>
  65. <li><a href="#register-groups-bind"><code>register-groups-bind</code></a>
  66. <li><a href="#do-scans"><code>do-scans</code></a>
  67. <li><a href="#do-matches"><code>do-matches</code></a>
  68. <li><a href="#do-matches-as-strings"><code>do-matches-as-strings</code></a>
  69. <li><a href="#do-register-groups"><code>do-register-groups</code></a>
  70. <li><a href="#all-matches"><code>all-matches</code></a>
  71. <li><a href="#all-matches-as-strings"><code>all-matches-as-strings</code></a>
  72. </ol>
  73. <li><a href="#splitting">Splitting and replacing</a>
  74. <ol>
  75. <li><a href="#split"><code>split</code></a>
  76. <li><a href="#regex-replace"><code>regex-replace</code></a>
  77. <li><a href="#regex-replace-all"><code>regex-replace-all</code></a>
  78. </ol>
  79. <li><a href="#modify">Modifying scanner behaviour</a>
  80. <ol>
  81. <li><a href="#*property-resolver*"><code>*property-resolver*</code></a>
  82. <li><a href="#parse-tree-synonym"><code>parse-tree-synonym</code></a>
  83. <li><a href="#define-parse-tree-synonym"><code>define-parse-tree-synonym</code></a>
  84. <li><a href="#*regex-char-code-limit*"><code>*regex-char-code-limit*</code></a>
  85. <li><a href="#*use-bmh-matchers*"><code>*use-bmh-matchers*</code></a>
  86. <li><a href="#*optimize-char-classes*"><code>*optimize-char-classes*</code></a>
  87. <li><a href="#*allow-quoting*"><code>*allow-quoting*</code></a>
  88. <li><a href="#*allow-named-registers*"><code>*allow-named-registers*</code></a>
  89. </ol>
  90. <li><a href="#misc">Miscellaneous</a>
  91. <ol>
  92. <li><a href="#parse-string"><code>parse-string</code></a>
  93. <li><a href="#create-optimized-test-function"><code>create-optimized-test-function</code></a>
  94. <li><a href="#quote-meta-chars"><code>quote-meta-chars</code></a>
  95. <li><a href="#regex-apropos"><code>regex-apropos</code></a>
  96. <li><a href="#regex-apropos-list"><code>regex-apropos-list</code></a>
  97. </ol>
  98. <li><a href="#conditions">Conditions</a>
  99. <ol>
  100. <li><a href="#ppcre-error"><code>ppcre-error</code></a>
  101. <li><a href="#ppcre-invocation-error"><code>ppcre-invocation-error</code></a>
  102. <li><a href="#ppcre-syntax-error"><code>ppcre-syntax-error</code></a>
  103. <li><a href="#ppcre-syntax-error-string"><code>ppcre-syntax-error-string</code></a>
  104. <li><a href="#ppcre-syntax-error-pos"><code>ppcre-syntax-error-pos</code></a>
  105. </ol>
  106. </ol>
  107. <li><a href="#unicode">Unicode properties</a>
  108. <ol>
  109. <li><a href="#unicode-property-resolver"><code>unicode-property-resolver</code></a>
  110. </ol>
  111. <li><a href="#filters">Filters</a>
  112. <li><a href="#perl">Compatibility with Perl</a>
  113. <ol>
  114. <li><a href="#empty">Empty strings instead of <code>undef</code> in <code>$1</code>, <code>$2</code>, etc.</a>
  115. <li><a href="#scope">Strange scoping of embedded modifiers</a>
  116. <li><a href="#inconsistent">Inconsistent capturing of <code>$1</code>, <code>$2</code>, etc.</a>
  117. <li><a href="#lookaround">Captured groups not available outside of look-aheads and look-behinds</a>
  118. <li><a href="#order">Alternations don't always work from left to right</a>
  119. <li><a href="#uprops">Different names for Unicode properties</a>
  120. <li><a href="#mac"><code>&quot;\r&quot;</code> doesn't work with MCL</a>
  121. <li><a href="#alpha">What about <code>&quot;\w&quot;</code>?</a>
  122. </ol>
  123. <li><a href="#bugs">Bugs and problems</a>
  124. <ol>
  125. <li><a href="#quote"><code>&quot;\Q&quot;</code> doesn't work, or does it?</a>
  126. <li><a href="#backslash">Backslashes may confuse you...</a>
  127. </ol>
  128. <li><a href="#allegro">AllegroCL compatibility mode</a>
  129. <li><a href="#blabla">Hints, comments, performance considerations</a>
  130. <li><a href="#ack">Acknowledgements</a>
  131. </ol>
  132. <br>&nbsp;<br><h3><a name="install" class=none>Download and installation</a></h3>
  133. CL-PPCRE together with this documentation can be downloaded from <a
  134. href="http://weitz.de/files/cl-ppcre.tar.gz">http://weitz.de/files/cl-ppcre.tar.gz</a>. The
  135. current version is 2.0.11.
  136. <p>
  137. CL-PPCRE comes with a system definition
  138. for <a href="http://www.cliki.net/asdf">ASDF</a> and you compile and
  139. load it in the usual way. There are no dependencies (except that the
  140. <a href="#test">test suite</a> which is not needed for normal operation depends
  141. on <a href="http://weitz.de/flexi-streams/">FLEXI-STREAMS</a>).
  142. <p>
  143. The preferred way to install CL-PPCRE is
  144. through <a href="http://www.quicklisp.org/" target="_new">Quicklisp</a>:
  145. <pre>(ql:quickload :cl-ppcre)</pre>
  146. </p>
  147. <p>
  148. <a class=none name="test">You</a> can run a test suite which tests most aspects of the library with
  149. <pre>
  150. (asdf:oos 'asdf:test-op :cl-ppcre)
  151. </pre>
  152. <p>
  153. The current development version of CL-PPCRE can be found
  154. at <a href="https://github.com/edicl/cl-ppcre">https://github.com/edicl/cl-ppcre</a>. If you want to send patches, please fork the github repository and send pull requests.
  155. <p>
  156. <br>&nbsp;<br><h3><a name="support" class=none>Support</a></h3>
  157. The development version of cl-ppcre can be
  158. found <a href="https://github.com/edicl/cl-ppcre" target="_new">on
  159. github</a>. Please use the github issue tracking system to submit bug
  160. reports. Patches are welcome, please
  161. use <a href="https://github.com/edicl/cl-ppcre/pulls">GitHub pull
  162. requests</a>. If you want to make a change,
  163. please <a href="http://weitz.de/patches.html" target="_new">read this
  164. first</a>.
  165. <br>&nbsp;<br><h3><a class=none name="dict">The CL-PPCRE dictionary</a></h3>
  166. <h4><a name="scanning" class=none>Scanning</a></h4>
  167. <p><br>[Method]
  168. <br><a class=none name="create-scanner"><b>create-scanner</b> <i>(string string)<tt>&amp;key</tt> case-insensitive-mode multi-line-mode single-line-mode extended-mode destructive</i> =&gt; <i>scanner, register-names</i></a>
  169. <blockquote><br> Accepts a string which is a regular expression in
  170. Perl syntax and returns a closure which will scan strings for this
  171. regular expression. The second value is only returned if <a href="#*allow-named-registers*"><code>*ALLOW-NAMED-REGISTERS*</code></a> is <i>true</i>. It represents a list of strings mapping registers to their respective names - the first element stands for first register, the second element for second register, etc. You have to store this value if you want to map a register number to its name later as <i>scanner</i> doesn't capture any information about register names. If a register isn't named, it has NIL as its name.
  172. <p>
  173. The mode keyword arguments are equivalent to the
  174. <code>&quot;imsx&quot;</code> modifiers in Perl. The
  175. <code>destructive</code> keyword will be ignored.
  176. <p>
  177. The function accepts most of the regex syntax of Perl 5.8 as described
  178. in <a href="http://perldoc.perl.org/5.8.8/perlre.html"><code>man
  179. perlre</code></a> including extended features like non-greedy
  180. repetitions, positive and negative look-ahead and look-behind
  181. assertions, &quot;standalone&quot; subexpressions, and conditional
  182. subpatterns. The following Perl features are (currently) <b>not</b>
  183. supported:
  184. <ul>
  185. <li><code>(?{ code })</code> and <code>(??{ code })</code> because
  186. they obviously don't make sense in Lisp.
  187. <li><code>\N{name}</code> (named characters), <code>\x{263a}</code>
  188. (wide hex characters), <code>\l</code>, <code>\u</code>,
  189. <code>\L</code>, and <code>\U</code>
  190. because they're actually not part of Perl's <em>regex</em> syntax - but see <a href="http://weitz.de/cl-interpol/">CL-INTERPOL</a>.
  191. <li><code>\X</code> (extended Unicode), and <code>\C</code> (single
  192. character). But you can of course use all characters
  193. supported by your CL implementation.
  194. <li>Posix character classes like <code>[[:alpha]]</code>.
  195. Use <a href="#unicode">Unicode properties</a> instead.
  196. <li><code>\G</code> for Perl's <code>pos()</code> because we don't have it.
  197. </ul>
  198. Note, however, that <code>\t</code>, <code>\n</code>, <code>\r</code>,
  199. <code>\f</code>, <code>\a</code>, <code>\e</code>, <code>\033</code>
  200. (octal character codes), <code>\x1B</code> (hexadecimal character
  201. codes), <code>\c[</code> (control characters), <code>\w</code>,
  202. <code>\W</code>, <code>\s</code>, <code>\S</code>, <code>\d</code>,
  203. <code>\D</code>, <code>\b</code>, <code>\B</code>, <code>\A</code>,
  204. <code>\Z</code>, and <code>\z</code> <b>are</b> supported.
  205. <p>
  206. Since version 0.6.0, CL-PPCRE also supports Perl's <code>\Q</code> and <code>\E</code> - see <a
  207. href="#*allow-quoting*"><code>*ALLOW-QUOTING*</code></a> below. Make sure you also read <a href="#quote">the relevant section</a> in &quot;<a href="#bugs">Bugs and problems</a>.&quot;
  208. <p>
  209. Since version 1.3.0, CL-PPCRE offers support for <a href="http://www.franz.com/support/documentation/7.0/doc/regexp.htm#regexp-new-capturing-2">AllegroCL's</a> <code>(?&lt;name&gt;"&lt;regex&gt;")</code> named registers and <code>\k&lt;name&gt;</code> back-references syntax, have a look at <a href="#*allow-named-registers*"><code>*ALLOW-NAMED-REGISTERS*</code></a> for details.
  210. <p>
  211. Since version 2.0.0, CL-PPCRE
  212. supports <a href="#*property-resolver*">named properties</a>
  213. (<code>\p</code> and <code>\P</code>), but only the long form with
  214. braces is supported, i.e. <code>\p{Letter}</code>
  215. and <code>\p{L}</code> will work while <code>\pL</code> won't.
  216. <p>
  217. The keyword arguments are just for your
  218. convenience. You can always use embedded modifiers like
  219. <code>&quot;(?i-s)&quot;</code> instead.</blockquote>
  220. <p><br>[Method]
  221. <br><a class=none name="create-scanner"><b>create-scanner</b> <i>(function function)<tt>&amp;key</tt> case-insensitive-mode multi-line-mode single-line-mode extended-mode destructive</i> =&gt; <i>scanner</i></a>
  222. <blockquote><br> In this case <code><i>function</i></code> should be a
  223. scanner returned by another invocation
  224. of <code>CREATE-SCANNER</code>. It will be returned as is. You can't
  225. use any of the keyword arguments because the scanner has already been
  226. created and is immutable.
  227. </blockquote>
  228. <p><br>[Method]
  229. <br><a class=none name="create-scanner2"><b>create-scanner</b> <i>(parse-tree t)<tt>&amp;key</tt> case-insensitive-mode multi-line-mode single-line-mode extended-mode destructive</i> =&gt; <i>scanner, register-names</i></a>
  230. <blockquote><br>
  231. This is similar to <a
  232. href="#create-scanner"><code>CREATE-SCANNER</code></a> for regex strings above but
  233. accepts a <em>parse tree</em> as its first argument. A parse tree is an S-expression
  234. conforming to the following syntax:
  235. <ul>
  236. <li>Every string and character is a parse tree and is treated
  237. <em>literally</em> as a part of the regular expression,
  238. i.e. parentheses, brackets, asterisks and such aren't special.
  239. <li>The symbol <code>:VOID</code> is equivalent to the empty string.
  240. <li>The symbol <code>:EVERYTHING</code> is equivalent to Perl's dot,
  241. i.e it matches everything (except maybe a newline character depending
  242. on the mode).
  243. <li>The symbols <code>:WORD-BOUNDARY</code> and
  244. <code>:NON-WORD-BOUNDARY</code> are equivalent to Perl's
  245. <code>&quot;\b&quot;</code> and <code>&quot;\B&quot;</code>.
  246. <li>The symbols <code>:DIGIT-CLASS</code>,
  247. <code>:NON-DIGIT-CLASS</code>, <code>:WORD-CHAR-CLASS</code>,
  248. <code>:NON-WORD-CHAR-CLASS</code>,
  249. <code>:WHITESPACE-CHAR-CLASS</code>, and
  250. <code>:NON-WHITESPACE-CHAR-CLASS</code> are equivalent to Perl's
  251. <em>special character classes</em> <code>&quot;\d&quot;</code>,
  252. <code>&quot;\D&quot;</code>, <code>&quot;\w&quot;</code>,
  253. <code>&quot;\W&quot;</code>, <code>&quot;\s&quot;</code>, and
  254. <code>&quot;\S&quot;</code> respectively.
  255. <li>The symbols <code>:START-ANCHOR</code>, <code>:END-ANCHOR</code>,
  256. <code>:MODELESS-START-ANCHOR</code>,
  257. <code>:MODELESS-END-ANCHOR</code>, and
  258. <code>:MODELESS-END-ANCHOR-NO-NEWLINE</code> are equivalent to Perl's
  259. <code>&quot;^&quot;</code>, <code>&quot;$&quot;</code>,
  260. <code>&quot;\A&quot;</code>, <code>&quot;\Z&quot;</code>, and
  261. <code>&quot;\z&quot;</code> respectively.
  262. <li>The symbols <code>:CASE-INSENSITIVE-P</code>,
  263. <code>:CASE-SENSITIVE-P</code>, <code>:MULTI-LINE-MODE-P</code>,
  264. <code>:NOT-MULTI-LINE-MODE-P</code>, <code>:SINGLE-LINE-MODE-P</code>,
  265. and <code>:NOT-SINGLE-LINE-MODE-P</code> are equivalent to Perl's
  266. <em>embedded modifiers</em> <code>&quot;(?i)&quot;</code>,
  267. <code>&quot;(?-i)&quot;</code>, <code>&quot;(?m)&quot;</code>,
  268. <code>&quot;(?-m)&quot;</code>, <code>&quot;(?s)&quot;</code>, and
  269. <code>&quot;(?-s)&quot;</code>. As usual, changes applied to modes are
  270. kept local to the innermost enclosing grouping or clustering
  271. construct.
  272. </li><li>All other symbols will signal an error of type <a
  273. href="#ppcre-syntax-error"><code>PPCRE-SYNTAX-ERROR</code></a>
  274. <em>unless</em> they are defined to be <a
  275. href="#parse-tree-synonym"><em>parse tree synonyms</em></a>.
  276. <li><code>(:FLAGS {&lt;modifier&gt;}*)</code> where
  277. <code>&lt;modifier&gt;</code> is one of the modifier symbols from
  278. above is used to group modifier symbols. The modifiers are applied
  279. from left to right. (This construct is obviously redundant. It is only
  280. there because it's used by the parser.)
  281. <li><code>(:SEQUENCE {&lt;<i>parse-tree</i>&gt;}*)</code> means a
  282. sequence of parse trees, i.e. the parse trees must match one after
  283. another. Example: <code>(:SEQUENCE #\f #\o #\o)</code> is equivalent
  284. to the parse tree <code>&quot;foo&quot;</code>.
  285. <li><code>(:GROUP {&lt;<i>parse-tree</i>&gt;}*)</code> is like
  286. <code>:SEQUENCE</code> but changes applied to modifier flags (see
  287. above) are kept local to the parse trees enclosed by this
  288. construct. Think of it as the S-expression variant of Perl's
  289. <code>&quot;(?:&lt;<i>pattern</i>&gt;)&quot;</code> construct.
  290. <li><code>(:ALTERNATION {&lt;<i>parse-tree</i>&gt;}*)</code> means an
  291. alternation of parse trees, i.e. one of the parse trees must
  292. match. Example: <code>(:ALTERNATION #\b #\a #\z)</code> is equivalent
  293. to the Perl regex string <code>&quot;b|a|z&quot;</code>.
  294. <li><code>(:BRANCH &lt;<i>test</i>&gt;
  295. &lt;<i>parse-tree</i>&gt;)</code> is for conditional regular
  296. expressions. <code>&lt;<i>test</i>&gt;</code> is either a number which
  297. stands for a register or a parse tree which is a look-ahead or
  298. look-behind assertion. See the entry for
  299. <code>(?(&lt;<i>condition</i>&gt;)&lt;<i>yes-pattern</i>&gt;|&lt;<i>no-pattern</i>&gt;)</code>
  300. in <a
  301. href="http://perldoc.perl.org/perlre.html#Extended-Patterns"><code>man
  302. perlre</code></a> for the semantics of this construct. If
  303. <code>&lt;<i>parse-tree</i>&gt;</code> is an alternation is
  304. <em>must</em> enclose exactly one or two parse trees where the second
  305. one (if present) will be treated as the &quot;no-pattern&quot; - in
  306. all other cases <code>&lt;<i>parse-tree</i>&gt;</code> will be treated
  307. as the &quot;yes-pattern&quot;.
  308. <li><code>(:POSITIVE-LOOKAHEAD|:NEGATIVE-LOOKAHEAD|:POSITIVE-LOOKBEHIND|:NEGATIVE-LOOKBEHIND
  309. &lt;<i>parse-tree</i>&gt;)</code> should be pretty obvious...
  310. <li><code>(:GREEDY-REPETITION|:NON-GREEDY-REPETITION
  311. &lt;<i>min</i>&gt; &lt;<i>max</i>&gt;
  312. &lt;<i>parse-tree</i>&gt;)</code> where
  313. <code>&lt;<i>min</i>&gt;</code> is a non-negative integer and
  314. <code>&lt;<i>max</i>&gt;</code> is either a non-negative integer not
  315. smaller than <code>&lt;<i>min</i>&gt;</code> or <code>NIL</code> will
  316. result in a regular expression which tries to match
  317. <code>&lt;<i>parse-tree</i>&gt;</code> at least
  318. <code>&lt;<i>min</i>&gt;</code> times and at most
  319. <code>&lt;<i>max</i>&gt;</code> times (or as often as possible if
  320. <code>&lt;<i>max</i>&gt;</code> is <code>NIL</code>). So, e.g.,
  321. <code>(:NON-GREEDY-REPETITION 0 1 &quot;ab&quot;)</code> is equivalent
  322. to the Perl regex string <code>&quot;(?:ab)??&quot;</code>.
  323. <li><code>(:STANDALONE &lt;<i>parse-tree</i>&gt;)</code> is an
  324. &quot;independent&quot; subexpression, i.e. <code>(:STANDALONE
  325. &quot;bar&quot;)</code> is equivalent to the Perl regex string
  326. <code>&quot;(?>bar)&quot;</code>.
  327. <li><code>(:REGISTER &lt;<i>parse-tree</i>&gt;)</code> is a capturing
  328. register group. As usual, registers are counted from left to right
  329. beginning with 1.
  330. <li><code>(:NAMED-REGISTER &lt;<i>name</i>&gt; &lt;<i>parse-tree</i>&gt;)</code> is a named capturing
  331. register group. Acts as <code>:REGISTER</code>, but assigns <code>&lt;<i>name</i>&gt;</code> to a register too. This <code>&lt;<i>name</i>&gt;</code> can be later referred to via <code>:BACK-REFERENCE</code>. Names are case-sensitive and don't need to be unique. See <a href="#*allow-named-registers*"><code>*ALLOW-NAMED-REGISTERS*</code></a> for details.
  332. <li><code>(:BACK-REFERENCE &lt;<i>ref</i>&gt;)</code> is a
  333. back-reference to a register group. <code>&lt;<i>ref</i>&gt;</code> is
  334. a positive integer or a string denoting a register name. If there are
  335. several registers with the same name, the regex engine tries to
  336. successfully match at least of them, starting with the most recently
  337. seen register continuing to the least recently seen one, until a match
  338. is found. See <a
  339. href="#*allow-named-registers*"><code>*ALLOW-NAMED-REGISTERS*</code></a>
  340. for more information.
  341. <li><code>(:PROPERTY|:INVERTED-PROPERTY &lt;<i>property</i>&gt;)</code> is
  342. a <a href="#*property-resolver*">named property</a> (or its inverse) with
  343. <code>&lt;<i>property</i>&gt;</code> being a function designator or a
  344. string which must be resolved
  345. by <a href="#*property-resolver*"><code>*PROPERTY-RESOLVER*</code></a>.
  346. <li><a class=none name="filterdef"><code>(:FILTER &lt;<i>function</i>&gt; <tt>&amp;optional</tt>
  347. &lt;<i>length</i>&gt;)</code></a> where
  348. <code>&lt;<i>function</i>&gt;</code> is a <a
  349. href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_f.htm#function_designator">function
  350. designator</a> and <code>&lt;<i>length</i>&gt;</code> is a
  351. non-negative integer or <code>NIL</code> is a user-defined <a
  352. href="#filters">filter</a>.
  353. <li><code>(:REGEX &lt;<i>string</i>&gt;)</code> where
  354. <code>&lt;<i>string</i>&gt;</code> is an
  355. embedded <a href="#create-scanner">regular expression in Perl
  356. syntax</a>.
  357. <li><code>(:CHAR-CLASS|:INVERTED-CHAR-CLASS
  358. {&lt;<i>item</i>&gt;}*)</code> where <code>&lt;<i>item</i>&gt;</code>
  359. is either a character, a <em>character range</em>, a named property
  360. (see above), or a symbol for a special character class (see above)
  361. will be translated into a (one character wide) character
  362. class. A <em>character range</em> looks like
  363. <code>(:RANGE &lt;<i>char1</i>&gt; &lt;<i>char2</i>&gt;)</code> where
  364. <code>&lt;<i>char1</i>&gt;</code> and
  365. <code>&lt;<i>char2</i>&gt;</code> are characters such that
  366. <code>(CHAR&lt;= &lt;<i>char1</i>&gt; &lt;<i>char2</i>&gt;)</code> is
  367. true. Example: <code>(:INVERTED-CHAR-CLASS #\a (:RANGE #\D #\G)
  368. :DIGIT-CLASS)</code> is equivalent to the Perl regex string
  369. <code>&quot;[^aD-G\d]&quot;</code>.
  370. </ul>
  371. Because <code>CREATE-SCANNER</code> is defined as a generic function
  372. which dispatches on its first argument there's a certain ambiguity:
  373. Although strings are valid parse trees they will be interpreted as
  374. Perl regex strings when given to <code>CREATE-SCANNER</code>. To
  375. circumvent this you can always use the equivalent parse tree <code>(:GROUP
  376. &lt;<i>string</i>&gt;)</code> instead.
  377. <p>
  378. Note that <code>CREATE-SCANNER</code> doesn't always check
  379. for the well-formedness of its first argument, i.e. you are expected
  380. to provide <em>correct</em> parse trees.
  381. <p>
  382. The usage of the keyword argument <code>extended-mode</code> obviously
  383. doesn't make sense if <code>CREATE-SCANNER</code> is applied to parse
  384. trees and will signal an error.
  385. <p>
  386. If <code>destructive</code> is not <code>NIL</code> (the default is
  387. <code>NIL</code>), the function is allowed to destructively modify
  388. <code><i>parse-tree</i></code> while creating the scanner.
  389. <p>
  390. If you want to find out how parse trees are related to Perl regex
  391. strings, you should play around with
  392. <a href="#parse-string"><code>PARSE-STRING</code></a>:
  393. <pre>
  394. * (parse-string "(ab)*")
  395. (:GREEDY-REPETITION 0 NIL (:REGISTER "ab"))
  396. * (parse-string "(a(b))")
  397. (:REGISTER (:SEQUENCE #\a (:REGISTER #\b)))
  398. * (parse-string "(?:abc){3,5}")
  399. (:GREEDY-REPETITION 3 5 (:GROUP "abc"))
  400. <font color=orange>;; (:GREEDY-REPETITION 3 5 "abc") would also be OK</font>
  401. * (parse-string "a(?i)b(?-i)c")
  402. (:SEQUENCE #\a
  403. (:SEQUENCE (:FLAGS :CASE-INSENSITIVE-P)
  404. (:SEQUENCE #\b (:SEQUENCE (:FLAGS :CASE-SENSITIVE-P) #\c))))
  405. <font color=orange>;; same as (:SEQUENCE #\a :CASE-INSENSITIVE-P #\b :CASE-SENSITIVE-P #\c)</font>
  406. * (parse-string "(?=a)b")
  407. (:SEQUENCE (:POSITIVE-LOOKAHEAD #\a) #\b)
  408. </pre></blockquote>
  409. <p><br>
  410. <font color=green><b>For the rest of the dictionary, </b><code><i>regex</i></code><b> can
  411. always be a string (which is interpreted as a Perl regular
  412. expression), a parse tree, or a scanner created by
  413. <a href="#create-scanner"><font color=green><code>CREATE-SCANNER</code></font></a>. The
  414. </b><code><i>start</i></code><b> and </b><code><i>end</i></code><b>
  415. keyword parameters are always used as in <a
  416. href="#scan"><font color=green><code>SCAN</code></font></a>.</b></font>
  417. <p><br>[Generic Function]
  418. <br><a class=none name="scan"><b>scan</b> <i>regex target-string <tt>&amp;key</tt> start end</i> =&gt; <i>match-start, match-end, reg-starts, reg-ends</i></a>
  419. <blockquote><br>
  420. Searches the string <code><i>target-string</i></code>
  421. from <code><i>start</i></code> (which defaults to 0) to
  422. <code><i>end</i></code> (which default to the length of
  423. <code><i>target-string</i></code>) and tries to match
  424. <code><i>regex</i></code>. On success returns four values - the start
  425. of the match, the end of the match, and two arrays denoting the
  426. beginnings and ends of register matches. On failure returns
  427. <code>NIL</code>. <code><i>target-string</i></code> will be coerced
  428. to a simple string if it isn't one already. (There's another keyword
  429. parameter <code><i>real-start-pos</i></code>. This one should
  430. <em>never</em> be set from user code - it is only used internally.)
  431. <p>
  432. <code>SCAN</code> acts as if the part of
  433. <code><i>target-string</i></code> between <code><i>start</i></code>
  434. and <code><i>end</i></code> were a standalone string, i.e. look-aheads
  435. and look-behinds can't look beyond these boundaries.
  436. <pre>
  437. * (scan "(a)*b" "xaaabd")
  438. 1
  439. 5
  440. #(3)
  441. #(4)
  442. * (scan "(a)*b" "xaaabd" :start 1)
  443. 1
  444. 5
  445. #(3)
  446. #(4)
  447. * (scan "(a)*b" "xaaabd" :start 2)
  448. 2
  449. 5
  450. #(3)
  451. #(4)
  452. * (scan "(a)*b" "xaaabd" :end 4)
  453. NIL
  454. * (scan '(:greedy-repetition 0 nil #\b) "bbbc")
  455. 0
  456. 3
  457. #()
  458. #()
  459. * (scan '(:greedy-repetition 4 6 #\b) "bbbc")
  460. NIL
  461. * (let ((s (create-scanner "(([a-c])+)x")))
  462. (scan s "abcxy"))
  463. 0
  464. 4
  465. #(0 2)
  466. #(3 3)
  467. </pre></blockquote>
  468. <p><br>[Function]
  469. <br><a class=none name="scan-to-strings"><b>scan-to-strings</b> <i>regex target-string <tt>&amp;key</tt> start end sharedp</i> =&gt; <i>match, regs</i></a>
  470. <blockquote><br>
  471. Like <a href="#scan"><code>SCAN</code></a> but returns substrings of
  472. <code><i>target-string</i></code> instead of positions, i.e. this
  473. function returns two values on success: the whole match as a string
  474. plus an array of substrings (or <code>NIL</code>s) corresponding to
  475. the matched registers. If <code><i>sharedp</i></code> is true, the substrings may share structure with
  476. <code><i>target-string</i></code>.
  477. <pre>
  478. * (scan-to-strings "[^b]*b" "aaabd")
  479. "aaab"
  480. #()
  481. * (scan-to-strings "([^b])*b" "aaabd")
  482. "aaab"
  483. #("a")
  484. * (scan-to-strings "(([^b])*)b" "aaabd")
  485. "aaab"
  486. #("aaa" "a")
  487. </pre></blockquote>
  488. <p><br>[Macro]
  489. <br><a class=none name="register-groups-bind"><b>register-groups-bind</b> <i>var-list (regex target-string <tt>&amp;key</tt> start end sharedp) declaration* statement*</i> =&gt; <i>result*</i></a>
  490. <blockquote><br>
  491. Evaluates <code><i>statement*</i></code> with the variables in <code><i>var-list</i></code> bound to the
  492. corresponding register groups after <code><i>target-string</i></code> has been matched
  493. against <code><i>regex</i></code>, i.e. each variable is either
  494. bound to a string or to <code>NIL</code>.
  495. As a shortcut, the elements of <code><i>var-list</i></code> can also be lists of the form <code>(FN&nbsp;VAR)</code> where <code>VAR</code> is the variable symbol
  496. and <code>FN</code> is a <a
  497. href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_f.htm#function_designator">function
  498. designator</a> (which is evaluated) denoting a function which is to be applied to the string before the result is bound to <code>VAR</code>.
  499. To make this even more convenient the form <code>(FN&nbsp;VAR1&nbsp;...VARn)</code> can be used as an abbreviation for
  500. <code>(FN&nbsp;VAR1)&nbsp;...&nbsp;(FN&nbsp;VARn)</code>.
  501. <p>
  502. If there is no match, the <code><i>statement*</i></code> forms are <em>not</em>
  503. executed. For each element of
  504. <code><i>var-list</i></code> which is <code>NIL</code> there's no binding to the corresponding register
  505. group. The number of variables in <code><i>var-list</i></code> must not be greater than
  506. the number of register groups. If <code><i>sharedp</i></code> is true, the substrings may
  507. share structure with <code><i>target-string</i></code>.
  508. <pre>
  509. * (register-groups-bind (first second third fourth)
  510. (&quot;((a)|(b)|(c))+&quot; &quot;abababc&quot; :sharedp t)
  511. (list first second third fourth))
  512. (&quot;c&quot; &quot;a&quot; &quot;b&quot; &quot;c&quot;)
  513. * (register-groups-bind (nil second third fourth)
  514. <font color=orange>;; note that we don't bind the first and fifth register group</font>
  515. (&quot;((a)|(b)|(c))()+&quot; &quot;abababc&quot; :start 6)
  516. (list second third fourth))
  517. (NIL NIL &quot;c&quot;)
  518. * (register-groups-bind (first)
  519. (&quot;(a|b)+&quot; &quot;accc&quot; :start 1)
  520. (format t &quot;This will not be printed: ~A&quot; first))
  521. NIL
  522. * (register-groups-bind (fname lname (#'parse-integer date month year))
  523. (&quot;(\\w+)\\s+(\\w+)\\s+(\\d{1,2})\\.(\\d{1,2})\\.(\\d{4})&quot; &quot;Frank Zappa 21.12.1940&quot;)
  524. (list fname lname (encode-universal-time 0 0 0 date month year 0)))
  525. ("Frank" "Zappa" 1292889600)
  526. </pre>
  527. </blockquote>
  528. <p><br>[Macro]
  529. <br><a class=none name="do-scans"><b>do-scans</b> <i>(match-start match-end reg-starts reg-ends regex target-string <tt>&amp;optional</tt> result-form <tt>&amp;key</tt> start end) declaration* statement*</i> =&gt; <i>result*</i></a>
  530. <blockquote><br>
  531. A macro which iterates over <code><i>target-string</i></code> and
  532. tries to match <code><i>regex</i></code> as often as possible
  533. evaluating <code><i>statement*</i></code> with
  534. <code><i>match-start</i></code>, <code><i>match-end</i></code>,
  535. <code><i>reg-starts</i></code>, and <code><i>reg-ends</i></code> bound
  536. to the four return values of each match (see <a
  537. href="#scan"><code>SCAN</code></a>) in turn. After the last match,
  538. returns <code><i>result-form</i></code> if provided or
  539. <code>NIL</code> otherwise. An implicit block named <code>NIL</code>
  540. surrounds <code>DO-SCANS</code>; <code>RETURN</code> may be used to
  541. terminate the loop immediately. If <code><i>regex</i></code> matches
  542. an empty string, the scan is continued one position behind this match.
  543. <p>
  544. This is the most general macro to iterate over all matches in a target
  545. string. See the source code of <a
  546. href="#do-matches"><code>DO-MATCHES</code></a>, <a
  547. href="#all-matches"><code>ALL-MATCHES</code></a>, <a
  548. href="#split"><code>SPLIT</code></a>, or <a
  549. href="#regex-replace-all"><code>REGEX-REPLACE-ALL</code></a> for examples of its
  550. usage.</blockquote>
  551. <p><br>[Macro]
  552. <br><a class=none name="do-matches"><b>do-matches</b> <i>(match-start match-end regex target-string <tt>&amp;optional</tt> result-form <tt>&amp;key</tt> start end) declaration* statement*</i> =&gt; <i>result*</i></a>
  553. <blockquote><br>
  554. Like <a href="#do-scans"><code>DO-SCANS</code></a> but doesn't bind
  555. variables to the register arrays.
  556. <pre>
  557. * (defun foo (regex target-string &amp;key (start 0) (end (length target-string)))
  558. (let ((sum 0))
  559. (do-matches (s e regex target-string nil :start start :end end)
  560. (incf sum (- e s)))
  561. (format t "~,2F% of the string was inside of a match~%"
  562. <font color=orange>;; note: doesn't check for division by zero</font>
  563. (float (* 100 (/ sum (- end start)))))))
  564. FOO
  565. * (foo "a" "abcabcabc")
  566. 33.33% of the string was inside of a match
  567. NIL
  568. * (foo "aa|b" "aacabcbbc")
  569. 55.56% of the string was inside of a match
  570. NIL
  571. </pre></blockquote>
  572. <p><br>[Macro]
  573. <br><a class=none name="do-matches-as-strings"><b>do-matches-as-strings</b> <i>(match-var regex target-string <tt>&amp;optional</tt> result-form <tt>&amp;key</tt> start end sharedp) declaration* statement*</i> =&gt; <i>result*</i></a>
  574. <blockquote><br>
  575. Like <a href="#do-matches"><code>DO-MATCHES</code></a> but binds
  576. <code><i>match-var</i></code> to the substring of
  577. <code><i>target-string</i></code> corresponding to each match in turn. If <code><i>sharedp</i></code> is true, the substrings may share structure with
  578. <code><i>target-string</i></code>.
  579. <pre>
  580. * (defun crossfoot (target-string &amp;key (start 0) (end (length target-string)))
  581. (let ((sum 0))
  582. (do-matches-as-strings (m :digit-class
  583. target-string nil
  584. :start start :end end)
  585. (incf sum (parse-integer m)))
  586. (if (< sum 10)
  587. sum
  588. (crossfoot (format nil "~A" sum)))))
  589. CROSSFOOT
  590. * (crossfoot "bar")
  591. 0
  592. * (crossfoot "a3x")
  593. 3
  594. * (crossfoot "12345")
  595. 6
  596. </pre>
  597. Of course, in real life you would do this with <a href="#do-matches"><code>DO-MATCHES</code></a> and use the <code><i>start</i></code> and <code><i>end</i></code> keyword parameters of <a href="http://www.lispworks.com/documentation/HyperSpec/Body/f_parse_.htm"><code>PARSE-INTEGER</code></a>.</blockquote>
  598. <p><br>[Macro]
  599. <br><a class=none name="do-register-groups"><b>do-register-groups</b> <i>var-list (regex target-string <tt>&amp;optional</tt> result-form <tt>&amp;key</tt> start end sharedp) declaration* statement*</i> =&gt; <i>result*</i></a>
  600. <blockquote><br>
  601. Iterates over <code><i>target-string</i></code> and tries to match <code><i>regex</i></code> as often as
  602. possible evaluating <code><i>statement*</i></code> with the variables in <code><i>var-list</i></code> bound to the
  603. corresponding register groups for each match in turn, i.e. each
  604. variable is either bound to a string or to <code>NIL</code>. You can use the same shortcuts and abbreviations as in <a href="#register-groups-bind"><code>REGISTER-GROUPS-BIND</code></a>. The number of
  605. variables in <code><i>var-list</i></code> must not be greater than the number of register
  606. groups. For each element of
  607. <code><i>var-list</i></code> which is <code>NIL</code> there's no binding to the corresponding register
  608. group. After the last match, returns <code><i>result-form</i></code> if provided or <code>NIL</code>
  609. otherwise. An implicit block named <code>NIL</code> surrounds <code>DO-REGISTER-GROUPS</code>;
  610. <code>RETURN</code> may be used to terminate the loop immediately. If <code><i>regex</i></code> matches
  611. an empty string, the scan is continued one position behind this
  612. match. If <code><i>sharedp</i></code> is true, the substrings may share structure with
  613. <code><i>target-string</i></code>.
  614. <pre>
  615. * (do-register-groups (first second third fourth)
  616. (&quot;((a)|(b)|(c))&quot; &quot;abababc&quot; nil :start 2 :sharedp t)
  617. (print (list first second third fourth)))
  618. (&quot;a&quot; &quot;a&quot; NIL NIL)
  619. (&quot;b&quot; NIL &quot;b&quot; NIL)
  620. (&quot;a&quot; &quot;a&quot; NIL NIL)
  621. (&quot;b&quot; NIL &quot;b&quot; NIL)
  622. (&quot;c&quot; NIL NIL &quot;c&quot;)
  623. NIL
  624. * (let (result)
  625. (do-register-groups ((#'parse-integer n) (#'intern sign) whitespace)
  626. (&quot;(\\d+)|(\\+|-|\\*|/)|(\\s+)&quot; &quot;12*15 - 42/3&quot;)
  627. (unless whitespace
  628. (push (or n sign) result)))
  629. (nreverse result))
  630. (12 * 15 - 42 / 3)
  631. </pre>
  632. </blockquote>
  633. <p><br>[Function]
  634. <br><a class=none name="all-matches"><b>all-matches</b> <i>regex target-string <tt>&amp;key</tt> start end</i> =&gt; <i>list</i></a>
  635. <blockquote><br>
  636. Returns a list containing the start and end positions of all matches
  637. of <code><i>regex</i></code> against
  638. <code><i>target-string</i></code>, i.e. if there are <code>N</code>
  639. matches the list contains <code>(* 2 N)</code> elements. If
  640. <code><i>regex</i></code> matches an empty string the scan is
  641. continued one position behind this match.
  642. <pre>
  643. * (all-matches "a" "foo bar baz")
  644. (5 6 9 10)
  645. * (all-matches "\\w*" "foo bar baz")
  646. (0 3 3 3 4 7 7 7 8 11 11 11)
  647. </pre></blockquote>
  648. <p><br>[Function]
  649. <br><a class=none name="all-matches-as-strings"><b>all-matches-as-strings</b> <i>regex target-string <tt>&amp;key</tt> start end sharedp</i> =&gt; <i>list</i></a>
  650. <blockquote><br>
  651. Like <a href="#all-matches"><code>ALL-MATCHES</code></a> but
  652. returns a list of substrings instead. If <code><i>sharedp</i></code> is true, the substrings may share structure with
  653. <code><i>target-string</i></code>.
  654. <pre>
  655. * (all-matches-as-strings "a" "foo bar baz")
  656. ("a" "a")
  657. * (all-matches-as-strings "\\w*" "foo bar baz")
  658. ("foo" "" "bar" "" "baz" "")
  659. </pre></blockquote>
  660. <h4><a name="splitting" class=none>Splitting and replacing</a></h4>
  661. <p><br>[Function]
  662. <br><a class=none name="split"><b>split</b> <i>regex target-string <tt>&amp;key</tt> start end limit with-registers-p omit-unmatched-p sharedp</i> =&gt; <i>list</i></a>
  663. <blockquote><br>
  664. Matches <code><i>regex</i></code> against
  665. <code><i>target-string</i></code> as often as possible and returns a
  666. list of the substrings between the matches. If
  667. <code><i>with-registers-p</i></code> is true, substrings corresponding
  668. to matched registers are inserted into the list as well. If
  669. <code><i>omit-unmatched-p</i></code> is true, unmatched registers will
  670. simply be left out, otherwise they will show up as
  671. <code>NIL</code>. <code><i>limit</i></code> limits the number of
  672. elements returned - registers aren't counted. If
  673. <code><i>limit</i></code> is <code>NIL</code> (or 0 which is
  674. equivalent), trailing empty strings are removed from the result list.
  675. If <code><i>regex</i></code> matches an empty string, the scan is
  676. continued one position behind this match. If <code><i>sharedp</i></code> is true, the substrings may share structure with
  677. <code><i>target-string</i></code>.
  678. <p>
  679. This function also tries hard to be
  680. Perl-compatible - thus the somewhat peculiar behaviour.
  681. <pre>
  682. * (split "\\s+" "foo bar baz
  683. frob")
  684. ("foo" "bar" "baz" "frob")
  685. * (split "\\s*" "foo bar baz")
  686. ("f" "o" "o" "b" "a" "r" "b" "a" "z")
  687. * (split "(\\s+)" "foo bar baz")
  688. ("foo" "bar" "baz")
  689. * (split "(\\s+)" "foo bar baz" :with-registers-p t)
  690. ("foo" " " "bar" " " "baz")
  691. * (split "(\\s)(\\s*)" "foo bar baz" :with-registers-p t)
  692. ("foo" " " "" "bar" " " " " "baz")
  693. * (split "(,)|(;)" "foo,bar;baz" :with-registers-p t)
  694. ("foo" "," NIL "bar" NIL ";" "baz")
  695. * (split "(,)|(;)" "foo,bar;baz" :with-registers-p t :omit-unmatched-p t)
  696. ("foo" "," "bar" ";" "baz")
  697. * (split ":" "a:b:c:d:e:f:g::")
  698. ("a" "b" "c" "d" "e" "f" "g")
  699. * (split ":" "a:b:c:d:e:f:g::" :limit 1)
  700. ("a:b:c:d:e:f:g::")
  701. * (split ":" "a:b:c:d:e:f:g::" :limit 2)
  702. ("a" "b:c:d:e:f:g::")
  703. * (split ":" "a:b:c:d:e:f:g::" :limit 3)
  704. ("a" "b" "c:d:e:f:g::")
  705. * (split ":" "a:b:c:d:e:f:g::" :limit 1000)
  706. ("a" "b" "c" "d" "e" "f" "g" "" "")
  707. </pre></blockquote>
  708. <p><br>[Function]
  709. <br><a class=none name="regex-replace"><b>regex-replace</b> <i>regex target-string replacement <tt>&amp;key</tt> start end preserve-case simple-calls element-type</i> =&gt; <i>string, matchp</i></a>
  710. <blockquote><br> Try to match <code><i>target-string</i></code>
  711. between <code><i>start</i></code> and <code><i>end</i></code> against
  712. <code><i>regex</i></code> and replace the first match with
  713. <code><i>replacement</i></code>. Two values are returned; the modified
  714. string, and <code>T</code> if <code><i>regex</i></code> matched or
  715. <code>NIL</code> otherwise.
  716. <p>
  717. <code><i>replacement</i></code> can be a string which may contain the
  718. special substrings <code>&quot;\&amp;&quot;</code> for the whole
  719. match, <code>&quot;\`&quot;</code> for the part of
  720. <code><i>target-string</i></code> before the match,
  721. <code>&quot;\'&quot;</code> for the part of
  722. <code><i>target-string</i></code> after the match,
  723. <code>&quot;\N&quot;</code> or <code>&quot;\{N}&quot;</code> for the
  724. <code>N</code>th register where <code>N</code> is a positive integer.
  725. <p>
  726. <code><i>replacement</i></code> can also be a <a
  727. href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_f.htm#function_designator">function
  728. designator</a> in which case the match will be replaced with the
  729. result of calling the function designated by
  730. <code><i>replacement</i></code> with the arguments
  731. <code><i>target-string</i></code>, <code><i>start</i></code>,
  732. <code><i>end</i></code>, <code><i>match-start</i></code>,
  733. <code><i>match-end</i></code>, <code><i>reg-starts</i></code>, and
  734. <code><i>reg-ends</i></code>. (<code><i>reg-starts</i></code> and
  735. <code><i>reg-ends</i></code> are arrays holding the start and end
  736. positions of matched registers (or <code>NIL</code>) - the meaning of
  737. the other arguments should be obvious.)
  738. <p>
  739. If <code><i>simple-calls</i></code> is true, a function designated by
  740. <code><i>replacement</i></code> will instead be called with the
  741. arguments <code><i>match</i></code>, <code><i>register-1</i></code>,
  742. ..., <code><i>register-n</i></code> where <code><i>match</i></code> is
  743. the whole match as a string and <code><i>register-1</i></code> to
  744. <code><i>register-n</i></code> are the matched registers, also as
  745. strings (or <code>NIL</code>). Note that these strings share structure with
  746. <code><i>target-string</i></code> so you must not modify them.
  747. <p>
  748. Finally, <code><i>replacement</i></code> can be a list where each
  749. element is a string (which will be inserted verbatim), one of the
  750. symbols <code>:match</code>, <code>:before-match</code>, or
  751. <code>:after-match</code> (corresponding to
  752. <code>&quot;\&amp;&quot;</code>, <code>&quot;\`&quot;</code>, and
  753. <code>&quot;\'&quot;</code> above), an integer <code>N</code>
  754. (representing register <code>(1+&nbsp;N)</code>), or a function
  755. designator.
  756. <p>
  757. If <code><i>preserve-case</i></code> is true (default is
  758. <code>NIL</code>), the replacement will try to preserve the case (all
  759. upper case, all lower case, or capitalized) of the match. The result
  760. will always be a <a
  761. href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_f.htm#fresh">fresh</a>
  762. string, even if <code><i>regex</i></code> doesn't match.
  763. <p>
  764. <code><i>element-type</i></code> specifies
  765. the <a
  766. href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_a.htm#array_element_type">array
  767. element type</a> of the string which is returned, the default
  768. is <a
  769. href="http://www.lispworks.com/documentation/lw50/LWRM/html/lwref-346.htm"><code>LW:SIMPLE-CHAR</code></a>
  770. for LispWorks
  771. and <a
  772. href="http://www.lispworks.com/documentation/HyperSpec/Body/t_ch.htm"><code>CHARACTER</code></a>
  773. for other Lisps.
  774. <pre>
  775. * (regex-replace "fo+" "foo bar" "frob")
  776. "frob bar"
  777. T
  778. * (regex-replace "fo+" "FOO bar" "frob")
  779. "FOO bar"
  780. NIL
  781. * (regex-replace "(?i)fo+" "FOO bar" "frob")
  782. "frob bar"
  783. T
  784. * (regex-replace "(?i)fo+" "FOO bar" "frob" :preserve-case t)
  785. "FROB bar"
  786. T
  787. * (regex-replace "(?i)fo+" "Foo bar" "frob" :preserve-case t)
  788. "Frob bar"
  789. T
  790. * (regex-replace "bar" "foo bar baz" "[frob (was '\\&' between '\\`' and '\\'')]")
  791. "foo [frob (was 'bar' between 'foo ' and ' baz')] baz"
  792. T
  793. * (regex-replace "bar" "foo bar baz"
  794. '("[frob (was '" :match "' between '" :before-match "' and '" :after-match "')]"))
  795. "foo [frob (was 'bar' between 'foo ' and ' baz')] baz"
  796. T
  797. * (regex-replace "(be)(nev)(o)(lent)"
  798. "benevolent: adj. generous, kind"
  799. #'(lambda (match &amp;rest registers)
  800. (format nil "~A [~{~A~^.~}]" match registers))
  801. :simple-calls t)
  802. "benevolent [be.nev.o.lent]: adj. generous, kind"
  803. T
  804. </pre></blockquote>
  805. <p><br>[Function]
  806. <br><a class=none name="regex-replace-all"><b>regex-replace-all</b> <i>regex target-string replacement <tt>&amp;key</tt> start end preserve-case simple-calls element-type</i> =&gt; <i>string, matchp</i></a>
  807. <blockquote><br>
  808. Like <a href="#regex-replace"><code>REGEX-REPLACE</code></a> but replaces all matches.
  809. <pre>
  810. * (regex-replace-all "(?i)fo+" "foo Fooo FOOOO bar" "frob" :preserve-case t)
  811. "frob Frob FROB bar"
  812. T
  813. * (regex-replace-all "(?i)f(o+)" "foo Fooo FOOOO bar" "fr\\1b" :preserve-case t)
  814. "froob Frooob FROOOOB bar"
  815. T
  816. * (let ((qp-regex (create-scanner "[\\x80-\\xff]")))
  817. (defun encode-quoted-printable (string)
  818. "Converts 8-bit string to quoted-printable representation."
  819. <font color=orange>;; won't work for Corman Lisp because non-ASCII characters aren't 8-bit there</font>
  820. (flet ((convert (target-string start end match-start match-end reg-starts reg-ends)
  821. (declare (ignore start end match-end reg-starts reg-ends))
  822. (format nil "=~2,'0x" (char-code (char target-string match-start)))))
  823. (regex-replace-all qp-regex string #'convert))))
  824. Converted ENCODE-QUOTED-PRINTABLE.
  825. ENCODE-QUOTED-PRINTABLE
  826. * (encode-quoted-printable "F&ecirc;te S&oslash;rensen na&iuml;ve H&uuml;hner Stra&szlig;e")
  827. "F=EAte S=F8rensen na=EFve H=FChner Stra=DFe"
  828. T
  829. * (let ((url-regex (create-scanner "[^a-zA-Z0-9_\\-.]")))
  830. (defun url-encode (string)
  831. "URL-encodes a string."
  832. <font color=orange>;; won't work for Corman Lisp because non-ASCII characters aren't 8-bit there</font>
  833. (flet ((convert (target-string start end match-start match-end reg-starts reg-ends)
  834. (declare (ignore start end match-end reg-starts reg-ends))
  835. (format nil "%~2,'0x" (char-code (char target-string match-start)))))
  836. (regex-replace-all url-regex string #'convert))))
  837. Converted URL-ENCODE.
  838. URL-ENCODE
  839. * (url-encode "F&ecirc;te S&oslash;rensen na&iuml;ve H&uuml;hner Stra&szlig;e")
  840. "F%EAte%20S%F8rensen%20na%EFve%20H%FChner%20Stra%DFe"
  841. T
  842. * (defun how-many (target-string start end match-start match-end reg-starts reg-ends)
  843. (declare (ignore start end match-start match-end))
  844. (format nil "~A" (- (svref reg-ends 0)
  845. (svref reg-starts 0))))
  846. HOW-MANY
  847. * (regex-replace-all "{(.+?)}"
  848. "foo{...}bar{.....}{..}baz{....}frob"
  849. (list "[" 'how-many " dots]"))
  850. "foo[3 dots]bar[5 dots][2 dots]baz[4 dots]frob"
  851. T
  852. * (let ((qp-regex (create-scanner "[\\x80-\\xff]")))
  853. (defun encode-quoted-printable (string)
  854. "Converts 8-bit string to quoted-printable representation.
  855. Version using SIMPLE-CALLS keyword argument."
  856. <font color=orange>;; ;; won't work for Corman Lisp because non-ASCII characters aren't 8-bit there</font>
  857. (flet ((convert (match)
  858. (format nil "=~2,'0x" (char-code (char match 0)))))
  859. (regex-replace-all qp-regex string #'convert
  860. :simple-calls t))))
  861. Converted ENCODE-QUOTED-PRINTABLE.
  862. ENCODE-QUOTED-PRINTABLE
  863. * (encode-quoted-printable "F&ecirc;te S&oslash;rensen na&iuml;ve H&uuml;hner Stra&szlig;e")
  864. "F=EAte S=F8rensen na=EFve H=FChner Stra=DFe"
  865. T
  866. * (defun how-many (match first-register)
  867. (declare (ignore match))
  868. (format nil "~A" (length first-register)))
  869. HOW-MANY
  870. * (regex-replace-all "{(.+?)}"
  871. "foo{...}bar{.....}{..}baz{....}frob"
  872. (list "[" 'how-many " dots]")
  873. :simple-calls t)
  874. "foo[3 dots]bar[5 dots][2 dots]baz[4 dots]frob"
  875. T
  876. </pre></blockquote>
  877. <h4><a name="modify" class=none>Modifying scanner behaviour</a></h4>
  878. <p><br>[Special variable]
  879. <br><a class=none name="*property-resolver*"><b>*property-resolver*</b></a>
  880. </p><blockquote><br> This is the designator for a function responsible
  881. for resolving named properties like <code>\p{Number}</code>. If
  882. CL-PPCRE encounters a <code>\p</code> or a <code>\P</code> it expects
  883. to see an opening curly brace immediately afterwards and will then
  884. read everything following that brace until it sees a closing curly
  885. brace. The resolver function will be called with this string and must
  886. return a corresponding unary test function which accepts a character
  887. as its argument and returns a true value if and only if the character
  888. has the named property. If the resolver returns <code>NIL</code>
  889. instead, it signals that a property of that name is unknown.
  890. <pre>
  891. * (labels ((char-code-odd-p (char)
  892. (oddp (char-code char)))
  893. (char-code-even-p (char)
  894. (evenp (char-code char)))
  895. (resolver (name)
  896. (cond ((string= name "odd") #'char-code-odd-p)
  897. ((string= name "even") #'char-code-even-p)
  898. ((string= name "true") (constantly t))
  899. (t (error "Can't resolve ~S." name)))))
  900. (let ((*property-resolver* #'resolver))
  901. <font color=orange>;; quiz question - why do we need CREATE-SCANNER here?</font>
  902. (list (regex-replace-all (create-scanner "\\p{odd}") "abcd" "+")
  903. (regex-replace-all (create-scanner "\\p{even}") "abcd" "+")
  904. (regex-replace-all (create-scanner "\\p{true}") "abcd" "+"))))
  905. ("+b+d" "a+c+" "++++")
  906. </pre>
  907. If the value
  908. of <a href="#*property-resolver*"><code>*PROPERTY-RESOLVER*</code></a>
  909. is <code>NIL</code> (which is the default), <code>\p</code> and <code>\P</code> in regex
  910. strings will simply be treated like <code>p</code> or <code>P</code>
  911. as in CL-PPCRE&nbsp;1.4.1 and earlier. Note that this does not affect
  912. the validity of <code>(:PROPERTY&nbsp;&lt;<i>name</i>&gt;)</code>
  913. parts in <a href="#create-scanner2">S-expression syntax</a>.
  914. </blockquote>
  915. <p><br>[Accessor]
  916. <br><a class="none" name="parse-tree-synonym"><b>parse-tree-synonym</b> <i>symbol</i> =&gt; <i>parse-tree</i>
  917. <br><tt>(setf (</tt><b>parse-tree-synonym</b> <i>symbol</i><tt>)</tt> <i>new-parse-tree</i><tt>)</tt></a>
  918. </p><blockquote><br>
  919. Any symbol (unless it's a keyword with a special meaning in parse
  920. trees) can be made a "synonym", i.e. an abbreviation, for another parse
  921. tree by this accessor. <code>PARSE-TREE-SYNONYM</code> returns <code>NIL</code> if <code><i>symbol</i></code> isn't a synonym yet.
  922. <pre>
  923. * (parse-string "a*b+")
  924. (:SEQUENCE (:GREEDY-REPETITION 0 NIL #\a) (:GREEDY-REPETITION 1 NIL #\b))
  925. * (defun my-repetition (char min)
  926. `(:greedy-repetition ,min nil ,char))
  927. MY-REPETITION
  928. * (setf (parse-tree-synonym 'a*) (my-repetition #\a 0))
  929. (:GREEDY-REPETITION 0 NIL #\a)
  930. * (setf (parse-tree-synonym 'b+) (my-repetition #\b 1))
  931. (:GREEDY-REPETITION 1 NIL #\b)
  932. * (let ((scanner (create-scanner '(:sequence a* b+))))
  933. (dolist (string '("ab" "b" "aab" "a" "x"))
  934. (print (scan scanner string)))
  935. (values))
  936. 0
  937. 0
  938. 0
  939. NIL
  940. NIL
  941. * (parse-tree-synonym 'a*)
  942. (:GREEDY-REPETITION 0 NIL #\a)
  943. * (parse-tree-synonym 'a+)
  944. NIL
  945. </pre></blockquote>
  946. <p><br>[Macro]
  947. <br><a class="none" name="define-parse-tree-synonym"><b>define-parse-tree-synonym</b> <i>name parse-tree</i> =&gt; <i>parse-tree</i></a>
  948. </p><blockquote><br>
  949. This is a convenience macro for parse tree synonyms defined as
  950. <pre>
  951. (defmacro define-parse-tree-synonym (name parse-tree)
  952. `(eval-when (:compile-toplevel :load-toplevel :execute)
  953. (setf (parse-tree-synonym ',name) ',parse-tree)))
  954. </pre>
  955. so you can write code like this:
  956. <pre>
  957. (define-parse-tree-synonym a-z
  958. (:char-class (:range #\a #\z) (:range #\A #\Z)))
  959. (define-parse-tree-synonym a-z*
  960. (:greedy-repetition 0 nil a-z))
  961. (defun ascii-char-tester (string)
  962. (scan '(:sequence :start-anchor a-z* :end-anchor)
  963. string))
  964. </pre></blockquote>
  965. <p><br>[Special variable]
  966. <br><a class=none name="*regex-char-code-limit*"><b>*regex-char-code-limit*</b></a>
  967. <blockquote><br>This variable controls whether scanners take into
  968. account all characters of your CL implementation or only those
  969. the <a
  970. href="http://www.lispworks.com/documentation/HyperSpec/Body/f_char_c.htm#char-code"><code>CHAR-CODE</code></a>
  971. of which is not larger than its value. The default is
  972. <a href="http://www.lispworks.com/documentation/HyperSpec/Body/v_char_c.htm"><code>CHAR-CODE-LIMIT</code></a>,
  973. and you might see significant speed and space improvements during
  974. scanner <em>creation</em> if, say, your target strings only
  975. contain <a href="http://czyborra.com/charsets/iso8859.html">ISO-8859-1</a>
  976. characters and you're using a Lisp implementation
  977. where <code>CHAR-CODE-LIMIT</code> has a value much higher
  978. than&nbsp;256. The <a href="#test">test suite</a> will automatically
  979. set <code>*REGEX-CHAR-CODE-LIMIT*</code> to 256 while you're running
  980. the default test.
  981. <p>
  982. Note: Due to the nature of <a href="http://www.lispworks.com/documentation/HyperSpec/Body/s_ld_tim.htm"><code>LOAD-TIME-VALUE</code></a> and the <a
  983. href="#compiler-macro">compiler macro for <code>SCAN</code> and other functions</a>, some
  984. scanners might be created in a <a
  985. href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_n.htm#null_lexical_environment">null
  986. lexical environment</a> at load time or at compile time so be careful
  987. to which value <code>*REGEX-CHAR-CODE-LIMIT*</code> is bound at that
  988. time. The default value should always yield correct results unless you
  989. play dirty tricks with implementation-dependent behaviour, though.</blockquote>
  990. <p><br>[Special variable]
  991. <br><a class=none name="*use-bmh-matchers*"><b>*use-bmh-matchers*</b></a>
  992. <blockquote><br>Usually, the scanners created
  993. by <a href="#create-scanner"><code>CREATE-SCANNER</code></a> (or
  994. implicitly by other functions and macros) will use the standard
  995. function <a href="http://www.lispworks.com/documentation/HyperSpec/Body/f_search.htm"><code>SEARCH</code></a>
  996. to check for constant strings at the start or end of the regular
  997. expression. If <code>*USE-BMH-MATCHERS*</code> is true (the default
  998. is <code>NIL</code>),
  999. fast <a href="http://www-igm.univ-mlv.fr/~lecroq/string/node18.html">Boyer-Moore-Horspool
  1000. matchers</a> will be used instead. This will usually be faster but
  1001. can make the scanners considerably bigger. Per BMH matcher - there
  1002. can be up to two per scanner - a fixnum array of
  1003. size <a href="#*regex-char-code-limit*"><code>*REGEX-CHAR-CODE-LIMIT*</code></a>
  1004. is allocated and closed over.
  1005. <p>
  1006. Note: Due to the nature of <a href="http://www.lispworks.com/documentation/HyperSpec/Body/s_ld_tim.htm"><code>LOAD-TIME-VALUE</code></a> and the <a
  1007. href="#compiler-macro">compiler macro for <code>SCAN</code> and other functions</a>, some
  1008. scanners might be created in a <a
  1009. href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_n.htm#null_lexical_environment">null
  1010. lexical environment</a> at load time or at compile time so be careful
  1011. to which value <code>*USE-BMH-MATCHERS*</code> is bound at that
  1012. time.</blockquote>
  1013. <p><br>[Special variable]<br><a class=none name='*optimize-char-classes*'><b>*optimize-char-classes*</b></a>
  1014. <blockquote><br>
  1015. Whether character classes should be compiled into look-ups into <em>O(1)</em>
  1016. data structures. This is usually fast but will be costly in terms of
  1017. scanner creation time and might be costly in terms of size if
  1018. <a href="#*regex-char-code-limit*"><code>*REGEX-CHAR-CODE-LIMIT*</code></a>
  1019. is high. This value will be used as the <code><i>kind</i></code>
  1020. keyword argument
  1021. to <a href="#create-optimized-test-function"><code>CREATE-OPTIMIZED-TEST-FUNCTION</code></a>
  1022. - see there for the possible non-<code>NIL</code> values. The default
  1023. value (<code>NIL</code>) should usually be fine unless you're sure
  1024. that you absolutely have to optimize some character classes for speed.
  1025. <p>
  1026. Note: Due to the nature
  1027. of <a href="http://www.lispworks.com/documentation/HyperSpec/Body/s_ld_tim.htm"><code>LOAD-TIME-VALUE</code></a>
  1028. and the <a href="#compiler-macro">compiler macro for <code>SCAN</code>
  1029. and other functions</a>, some scanners might be created in
  1030. a <a href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_n.htm#null_lexical_environment">null
  1031. lexical environment</a> at load time or at compile time so be careful
  1032. to which value <code>*OPTIMIZE-CHAR-CLASSES*</code> is bound at that
  1033. time.
  1034. </blockquote>
  1035. <p><br>[Special variable]
  1036. <br><a class=none name="*allow-quoting*"><b>*allow-quoting*</b></a>
  1037. <blockquote><br>
  1038. If this value is <em>true</em> (the default is <code>NIL</code>),
  1039. CL-PPCRE will support <code>\Q</code> and <code>\E</code> in regex
  1040. strings to quote (disable) metacharacters. Note that this entails a
  1041. slight performance penalty when creating scanners because (a copy of) the regex
  1042. string is modified (probably more than once) before it
  1043. is fed to the parser. Also, the parser's <a
  1044. href="#ppcre-syntax-error">syntax error messages</a> will complain
  1045. about the converted string and not about the original regex string.
  1046. <pre>
  1047. * (scan &quot;^a+$&quot; &quot;a+&quot;)
  1048. NIL
  1049. * (let ((*allow-quoting* t))
  1050. <font color=orange>;;we use CREATE-SCANNER because of Lisps like SBCL that don't have an interpreter</font>
  1051. (scan (create-scanner &quot;^\\Qa+\\E$&quot;) &quot;a+&quot;))
  1052. 0
  1053. 2
  1054. #()
  1055. #()
  1056. * (let ((*allow-quoting* t))
  1057. (scan (create-scanner &quot;\\Qa()\\E(?#comment\\Q)a**b&quot;) &quot;()ab&quot;))
  1058. Quantifier '*' not allowed at position 19 in string &quot;a\\(\\)(?#commentQ)a**b&quot;
  1059. </pre>
  1060. Note how in the last example the regex string in the error message is
  1061. different from the first argument to the <code>SCAN</code>
  1062. function. Also note that the second example might be easier to
  1063. understand (and Lisp-ier) if you write it like this:
  1064. <pre>
  1065. * (scan '(:sequence :start-anchor
  1066. &quot;a+&quot; <font color=orange>;; no quoting necessary</font>
  1067. :end-anchor)
  1068. &quot;a+&quot;)
  1069. 0
  1070. 2
  1071. #()
  1072. #()
  1073. </pre>
  1074. Make sure you also read <a href="#quote">the relevant section</a> in &quot;<a href="#bugs">Bugs and problems</a>.&quot;
  1075. <p>
  1076. Note: Due to the nature of <a href="http://www.lispworks.com/documentation/HyperSpec/Body/s_ld_tim.htm"><code>LOAD-TIME-VALUE</code></a> and the <a
  1077. href="#compiler-macro">compiler macro for <code>SCAN</code> and other functions</a>, some
  1078. scanners might be created in a <a
  1079. href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_n.htm#null_lexical_environment">null
  1080. lexical environment</a> at load time or at compile time so be careful
  1081. to which value <code>*ALLOW-QUOTING*</code> is bound at that
  1082. time.</blockquote>
  1083. </blockquote>
  1084. <p><br>[Special variable]
  1085. <br><a class=none name="*allow-named-registers*"><b>*allow-named-registers*</b></a>
  1086. <blockquote><br>
  1087. If this value is <em>true</em> (the default is <code>NIL</code>),
  1088. CL-PPCRE will support <code>(?<i>&lt;name&gt;"&lt;regex&gt;"</i>)</code> and <code>\k<i>&lt;name&gt;</i></code> in regex
  1089. strings to provide named registers and back-references as in <a href="http://www.franz.com/support/documentation/7.0/doc/regexp.htm#regexp-new-capturing-2">AllegroCL</a>. <code><i>name</i></code> is has to start with a letter and can contain only alphanumeric characters or minus sign. Names of registers are matched case-sensitively.
  1090. The <a href="#create-scanner2">parse tree syntax</a> is not affected by the <code>*ALLOW-NAMED-REGISTERS*</code> switch, <code>:NAMED-REGISTER</code> and <code>:BACK-REFERENCE</code> forms are always resolved as expected. There are also no restrictions on register names in this syntax except that they have to be strings.
  1091. <pre>
  1092. <font color=orange>;; Perl compatible mode (*ALLOW-NAMED-REGISTERS* is NIL)</font>
  1093. * (create-scanner "(?&lt;reg&gt;.*)")
  1094. Character 'r' may not follow '(?&lt' at position 3 in string "(?&lt;reg&gt;)"
  1095. <font color=orange>;; just unescapes "\\k"</font>
  1096. * (parse-string "\\k&lt;reg&gt;")
  1097. "k&lt;reg&gt;"
  1098. * (setq *allow-named-registers* t)
  1099. T
  1100. * (create-scanner "((?&lt;small&gt;[a-z]*)(?&lt;big&gt;[A-Z]*))")
  1101. #&LT;CLOSURE (LAMBDA (STRING CL-PPCRE::START CL-PPCRE::END)) {AD75BFD}&gt;
  1102. (NIL "small" "big")
  1103. <font color=orange>;; the scanner doesn't capture any information about named groups -
  1104. ;; you have to store the second value returned from CREATE-SCANNER yourself</font>
  1105. * (scan * "aaaBBB")
  1106. 0
  1107. 6
  1108. #(0 0 3)
  1109. #(6 3 6)
  1110. <font color=orange>;; parse tree syntax</font>
  1111. * (parse-string "((?&lt;small&gt;[a-z]*)(?&lt;big&gt;[A-Z]*))")
  1112. (:REGISTER
  1113. (:SEQUENCE
  1114. (:NAMED-REGISTER "small"
  1115. (:GREEDY-REPETITION 0 NIL (:CHAR-CLASS (:RANGE #\a #\z))))
  1116. (:NAMED-REGISTER "big"
  1117. (:GREEDY-REPETITION 0 NIL (:CHAR-CLASS (:RANGE #\A #\Z))))))
  1118. * (create-scanner *)
  1119. #&lt;CLOSURE (LAMBDA (STRING CL-PPCRE::START CL-PPCRE::END)) {B158E3D}&gt;
  1120. (NIL "small" "big")
  1121. <font color=orange>;; multiple-choice back-reference</font>
  1122. * (scan "^(?&lt;reg&gt;[ab])(?&lt;reg&gt;[12])\\k&lt;reg&gt;\\k&lt;reg&gt;$" "a1aa")
  1123. 0
  1124. 4
  1125. #(0 1)
  1126. #(1 2)
  1127. * (scan "^(?&lt;reg&gt;[ab])(?&lt;reg&gt;[12])\\k&lt;reg&gt;\\k&lt;reg&gt;$" "a22a")
  1128. 0
  1129. 4
  1130. #(0 1)
  1131. #(1 2)
  1132. <font color=orange>;; demonstrating most-recently-seen-register-first property of back-reference;
  1133. ;; "greedy" regex (analogous to "aa?")</font>
  1134. * (scan "^(?&lt;reg&gt;)(?&lt;reg&gt;a)(\\k&lt;reg&gt;)" "a")
  1135. 0
  1136. 1
  1137. #(0 0 1)
  1138. #(0 1 1)
  1139. * (scan "^(?&lt;reg&gt;)(?&lt;reg&gt;a)(\\k&lt;reg&gt;)" "aa")
  1140. 0
  1141. 2
  1142. #(0 0 1)
  1143. #(0 1 2)
  1144. <font color=orange>;; switched groups
  1145. ;; "lazy" regex (analogous to "aa??")</font>
  1146. * (scan "^(?&lt;reg&gt;a)(?&lt;reg&gt;)(\\k&lt;reg&gt;)" "a")
  1147. 0
  1148. 1
  1149. #(0 1 1)
  1150. #(1 1 1)
  1151. <font color=orange>;; scanner ignores the second "a"</font>
  1152. * (scan "^(?&lt;reg&gt;a)(?&lt;reg&gt;)(\\k&lt;reg&gt;)" "aa")
  1153. 0
  1154. 1
  1155. #(0 1 1)
  1156. #(1 1 1)
  1157. <font color=orange>;; "aa" will be matched only when forced by adding "$" at the end</font>
  1158. * (scan "^(?&lt;reg&gt;a)(?&lt;reg&gt;)(\\k&lt;reg&gt;)$" "aa")
  1159. 0
  1160. 2
  1161. #(0 1 1)
  1162. #(1 1 2)
  1163. </pre>
  1164. Note: Due to the nature of <a href="http://www.lispworks.com/documentation/HyperSpec/Body/s_ld_tim.htm"><code>LOAD-TIME-VALUE</code></a> and the <a
  1165. href="#compiler-macro">compiler macro for <code>SCAN</code> and other functions</a>, some
  1166. scanners might be created in a <a
  1167. href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_n.htm#null_lexical_environment">null
  1168. lexical environment</a> at load time or at compile time so be careful
  1169. to which value <code>*ALLOW-NAMED-REGISTERS*</code> is bound at that
  1170. time.</blockquote>
  1171. </blockquote>
  1172. <h4><a name="misc" class=none>Miscellaneous</a></h4>
  1173. <p><br>[Function]
  1174. <br><a class=none name="parse-string"><b>parse-string</b> <i>string</i> =&gt; <i>parse-tree</i></a>
  1175. <blockquote><br> Converts the <a href="#create-scanner">regex
  1176. string</a> <code><i>string</i></code> into a <a href="#create-scanner2">parse tree</a>.
  1177. Note that the result is usually one possible way of creating an
  1178. equivalent parse tree and not necessarily the "canonical" one.
  1179. Specifically, the parse tree might contain redundant parts which are
  1180. supposed to be excised when a scanner is created.
  1181. </blockquote>
  1182. <p><br>[Function]<br><a class=none name='create-optimized-test-function'><b>create-optimized-test-function</b> <i>test-function <tt>&amp;key</tt> start end kind</i> =&gt; <i>function</i></a>
  1183. <blockquote><br>
  1184. Given a unary test function <code><i>test-function</i></code> which is
  1185. applicable to characters returns a function which yields the same
  1186. boolean results for all characters with character codes
  1187. from <code><i>start</i></code> to (excluding) <code><i>end</i></code>.
  1188. If <code><i>kind</i></code>
  1189. is <code>NIL</code>, <code><i>test-function</i></code> will simply be
  1190. returned. Otherwise, <code><i>kind</i></code> should be one of:
  1191. <dl>
  1192. <dt><code>:HASH-TABLE</code></dt>
  1193. <dd>The function builds a hash table representing all characters which
  1194. satisfy the test and returns a closure which checks if a character is
  1195. in that hash table.</dd>
  1196. <dt><code>:CHARSET</code></dt>
  1197. <dd>Instead of a hash table the function uses a &quot;charset&quot;
  1198. which is a data structure using non-linear hashing and optimized to
  1199. represent (sparse) sets of characters in a fast and space-efficient
  1200. way (contributed by Nikodemus Siivola).</dd>
  1201. <dt><code>:CHARMAP</code></dt>
  1202. <dd>Instead of a hash table the function uses a bit vector to
  1203. represent the set of characters.</dd>
  1204. </dl>
  1205. You can also use <code>:HASH-TABLE*</code> or <code>:CHARSET*</code>
  1206. which are like <code>:HASH-TABLE</code> and <code>:CHARSET</code> but
  1207. use the complement of the set if the set contains more than half of
  1208. all characters between <code><i>start</i></code>
  1209. and <code><i>end</i></code>. This saves space but needs an additional
  1210. pass across all characters to create the data structure. There is no
  1211. corresponding <code>:CHARMAP*</code> <code><i>kind</i></code> as the bit vectors are
  1212. already created to cover the smallest possible interval which contains
  1213. either the set or its complement.
  1214. <p>
  1215. See also <a href="#*optimize-char-classes*"><code>*OPTIMIZE-CHAR-CLASSES*</code></a>.
  1216. </blockquote>
  1217. <p><br>[Function]
  1218. <br><a class=none name="quote-meta-chars"><b>quote-meta-chars</b> <i>string</i> =&gt; <i>string'</i></a>
  1219. <blockquote><br>
  1220. This is a simple utility function used when <a
  1221. href="#*allow-quoting*"><code>*ALLOW-QUOTING*</code></a> is
  1222. <em>true</em>. It returns a string <code>STRING'</code> where all
  1223. non-word characters (everything except ASCII characters, digits and
  1224. underline) of <code>STRING</code> are quoted by prepending a
  1225. backslash similar to Perl's <code>quotemeta</code> function. It always returns a <a
  1226. href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_f.htm#fresh">fresh</a>
  1227. string.
  1228. <pre>
  1229. * (quote-meta-chars &quot;[a-z]*&quot;)
  1230. &quot;\\[a\\-z\\]\\*&quot;
  1231. </pre></blockquote>
  1232. <p><br>[Function]
  1233. <br><a class=none name="regex-apropos"><b>regex-apropos</b> <i>regex <tt>&amp;optional</tt> packages <tt>&amp;key</tt> case-insensitive</i> =&gt; <i>list</i></a>
  1234. <blockquote><br>
  1235. Like <a
  1236. href="http://www.lispworks.com/documentation/HyperSpec/Body/f_apropo.htm"><code>APROPOS</code></a>
  1237. but searches for interned symbols which match the regular expression
  1238. <code><i>regex</i></code>. The output is implementation-dependent. If
  1239. <code><i>case-insensitive</i></code> is true (which is the default)
  1240. and <code><i>regex</i></code> isn't already a scanner, a
  1241. case-insensitive scanner is used.
  1242. <p>
  1243. Here are examples for CMUCL:
  1244. <pre>
  1245. * *package*
  1246. #&lt;The COMMON-LISP-USER package, 16/21 internal, 0/9 external&gt;
  1247. * (defun foo (n &amp;optional (k 0)) (+ 3 n k))
  1248. FOO
  1249. * (defparameter foo "bar")
  1250. FOO
  1251. * (defparameter |foobar| 42)
  1252. |foobar|
  1253. * (defparameter fooboo 43)
  1254. FOOBOO
  1255. * (defclass frobar () ())
  1256. #&lt;STANDARD-CLASS FROBAR {4874E625}&gt;
  1257. * (regex-apropos "foo(?:bar)?")
  1258. FOO [variable] value: "bar"
  1259. [compiled function] (N &amp;OPTIONAL (K 0))
  1260. FOOBOO [variable] value: 43
  1261. |foobar| [variable] value: 42
  1262. * (regex-apropos "(?:foo|fro)bar")
  1263. PCL::|COMMON-LISP-USER::FROBAR class predicate| [compiled closure]
  1264. FROBAR [class] #&lt;STANDARD-CLASS FROBAR {4874E625}&gt;
  1265. |foobar| [variable] value: 42
  1266. * (regex-apropos "(?:foo|fro)bar" 'cl-user)
  1267. FROBAR [class] #&lt;STANDARD-CLASS FROBAR {4874E625}&gt;
  1268. |foobar| [variable] value: 42
  1269. * (regex-apropos "(?:foo|fro)bar" '(pcl ext))
  1270. PCL::|COMMON-LISP-USER::FROBAR class predicate| [compiled closure]
  1271. * (regex-apropos "foo")
  1272. FOO [variable] value: "bar"
  1273. [compiled function] (N &amp;OPTIONAL (K 0))
  1274. FOOBOO [variable] value: 43
  1275. |foobar| [variable] value: 42
  1276. * (regex-apropos "foo" nil :case-insensitive nil)
  1277. |foobar| [variable] value: 42
  1278. </pre></blockquote>
  1279. <p><br>[Function]
  1280. <br><a class=none name="regex-apropos-list"><b>regex-apropos-list</b> <i>regex <tt>&amp;optional</tt> packages <tt>&amp;key</tt> upcase</i> =&gt; <i>list</i></a>
  1281. <blockquote><br>
  1282. Like <a
  1283. href="http://www.lispworks.com/documentation/HyperSpec/Body/f_apropo.htm"><code>APROPOS-LIST</code></a>
  1284. but searches for interned symbols which match the regular expression
  1285. <code><i>regex</i></code>. If <code><i>case-insensitive</i></code> is
  1286. true (which is the default) and <code><i>regex</i></code> isn't
  1287. already a scanner, a case-insensitive scanner is used.
  1288. <p>
  1289. Example (continued from above):
  1290. <pre>
  1291. * (regex-apropos-list &quot;foo(?:bar)?&quot;)
  1292. (|foobar| FOOBOO FOO)
  1293. </pre></blockquote>
  1294. <h4><a name="conditions" class=none>Conditions</a></h4>
  1295. <p><br>[Condition type]
  1296. <br><a class=none name="ppcre-error"><b>ppcre-error</b></a>
  1297. <blockquote><br>
  1298. Every error signaled by CL-PPCRE is of type
  1299. <code>PPCRE-ERROR</code>. This is a direct subtype of <a
  1300. href="http://www.lispworks.com/documentation/HyperSpec/Body/e_smp_er.htm"><code>SIMPLE-ERROR</code></a>
  1301. without any additional slots or options.
  1302. </blockquote>
  1303. <p><br>[Condition type]
  1304. <br><a class=none name="ppcre-invocation-error"><b>ppcre-invocation-error</b></a>
  1305. <blockquote><br>
  1306. Errors of type <code>PPCRE-INVOCATION-ERROR</code>
  1307. are signaled if one of the exported functions of CL-PPCRE is called with wrong or
  1308. inconsistent arguments. This is a direct subtype of <a
  1309. href="#ppcre-error"><code>PPCRE-ERROR</code></a> without any
  1310. additional slots or options.
  1311. </blockquote>
  1312. <p><br>[Condition type]
  1313. <br><a class=none name="ppcre-syntax-error"><b>ppcre-syntax-error</b></a>
  1314. <blockquote><br>
  1315. An error of type <code>PPCRE-SYNTAX-ERROR</code> is signaled if
  1316. CL-PPCRE's parser encounters an error when trying to parse a regex
  1317. string or to convert a parse tree into its internal representation.
  1318. This is a direct subtype of <a
  1319. href="#ppcre-error"><code>PPCRE-ERROR</code></a> with two additional
  1320. slots. These denote the regex string which HTML-PPCRE was parsing and
  1321. the position within the string where the error occurred. If the error
  1322. happens while CL-PPCRE is converting a parse tree, both of these slots
  1323. contain <code>NIL</code>. (See the next two entries on how to access
  1324. these slots.)
  1325. <p>
  1326. As many syntax errors can't be detected before the parser is at the
  1327. end of the stream, the row and column usually denote the last position
  1328. where the parser was happy and not the position where it gave up.
  1329. <pre>
  1330. * (handler-case
  1331. (scan &quot;foo**x&quot; &quot;fooox&quot;)
  1332. (ppcre-syntax-error (condition)
  1333. (format t &quot;Houston, we've got a problem with the string ~S:~%~
  1334. Looks like something went wrong at position ~A.~%~
  1335. The last message we received was \&quot;~?\&quot;.&quot;
  1336. (ppcre-syntax-error-string condition)
  1337. (ppcre-syntax-error-pos condition)
  1338. (simple-condition-format-control condition)
  1339. (simple-condition-format-arguments condition))
  1340. (values)))
  1341. Houston, we've got a problem with the string &quot;foo**x&quot;:
  1342. Looks like something went wrong at position 4.
  1343. The last message we received was &quot;Quantifier '*' not allowed.&quot;.
  1344. </pre>
  1345. </blockquote>
  1346. <p><br>[Function]
  1347. <br><a class=none name="ppcre-syntax-error-string"><b>ppcre-syntax-error-string</b></a> <i>condition</i> =&gt; <i>string</i>
  1348. <blockquote><br>
  1349. If <code><i>condition</i></code> is a condition of type <a
  1350. href="#ppcre-syntax-error"><code>PPCRE-SYNTAX-ERROR</code></a>, this
  1351. function will return the string the parser was parsing when the error was
  1352. encountered (or <code>NIL</code> if the error happened while trying to
  1353. convert a parse tree). This might be particularly useful when <a
  1354. href="#*allow-quoting*"><code>*ALLOW-QUOTING*</code></a> is
  1355. <em>true</em> because in this case the offending string might not be the one you gave to the <a
  1356. href="#create-scanner"><code>CREATE-SCANNER</code></a> function.
  1357. </blockquote>
  1358. <p><br>[Function]
  1359. <br><a class=none name="ppcre-syntax-error-pos"><b>ppcre-syntax-error-pos</b></a> <i>condition</i> =&gt; <i>number</i>
  1360. <blockquote><br>
  1361. If <code><i>condition</i></code> is a condition of type <a
  1362. href="#ppcre-syntax-error"><code>PPCRE-SYNTAX-ERROR</code></a>, this
  1363. function will return the position within the string where the error
  1364. occurred (or <code>NIL</code> if the error happened while trying to
  1365. convert a parse tree).
  1366. </blockquote>
  1367. <br>&nbsp;<br><h3><a name="unicode" class=none>Unicode properties</a></h3>
  1368. You can add support for Unicode properties to CL-PPCRE by loading
  1369. the CL-PPCRE-UNICODE system (which depends on <a href="http://weitz.de/cl-unicode/">CL-UNICODE</a>):
  1370. <pre>
  1371. (asdf:oos 'asdf:load-op :cl-ppcre-unicode)
  1372. </pre>
  1373. This will automatically
  1374. install <a href="#unicode-property-resolver"><code>UNICODE-PROPERTY-RESOLVER</code></a>
  1375. as your <a href="#*property-resolver*">property resolver</a>.
  1376. <p>
  1377. See the <a href="http://weitz.de/cl-unicode/">CL-UNICODE</a>
  1378. documentation for information about the supported Unicode properties
  1379. and how they are named.
  1380. <p><br>[Function]<br><a class=none name='unicode-property-resolver'><b>unicode-property-resolver</b> <i>property-name</i> =&gt; <i>function-or-nil</i></a>
  1381. <blockquote><br>
  1382. A <a href="#*property-resolver*">property
  1383. resolver</a> which understands Unicode properties using
  1384. <a href="http://weitz.de/cl-unicode/">CL-UNICODE</a>'s <a href="http://weitz.de/cl-unicode/#property-test"><code>PROPERTY-TEST</code></a>
  1385. function. This resolver is automatically installed
  1386. in <a href="#*property-resolver*"><code>*PROPERTY-RESOLVER*</code></a>
  1387. when the <a href="#unicode">CL-PPCRE-UNICODE</a> system is loaded.
  1388. <pre>
  1389. * (scan-to-strings "\\p{Script:Latin}+" "0+AB_*")
  1390. "AB"
  1391. #()
  1392. </pre>
  1393. Note that this symbol is exported from
  1394. the <code>CL-PPCRE-UNICODE</code> package and not from
  1395. the <code>CL-PPCRE</code> package.
  1396. </blockquote>
  1397. <br>&nbsp;<br><h3><a name="filters" class=none>Filters</a></h3>
  1398. Because several users have asked for it, CL-PPCRE now offers
  1399. &quot;filters&quot; (see <a href="#filterdef">above</a> for syntax)
  1400. which are basically arbitrary, user-defined functions that can act as
  1401. regex building blocks. Filters can only be used within <a
  1402. href="#create-scanner2">parse trees</a>, not within Perl regex
  1403. strings.
  1404. <p>
  1405. A filter is defined by its <em>filter function</em> which must be a
  1406. function of one argument. During the parsing process this function
  1407. might be called once or several times or it might not be called at
  1408. all. If it's called, its argument is an integer <code><i>pos</i></code>
  1409. which is the current position within the target string. The filter can
  1410. either return <code>NIL</code> (which means that the subexpression
  1411. represented by this filter didn't match) or an integer not smaller
  1412. than <code><i>pos</i></code> for success. A zero-length assertion
  1413. should return <code><i>pos</i></code> itself while a filter which
  1414. wants to consume <code>N</code> characters should return
  1415. <code>(+&nbsp;POS&nbsp;N)</code>.
  1416. <p>
  1417. If you supply the optional value <code><i>length</i></code> and it is
  1418. not <code>NIL</code>, then this is a promise to the regex engine that
  1419. your filter will <em>always</em> consume <em>exactly</em>
  1420. <code><i>length</i></code> characters. The regex engine might use this
  1421. information for optimization purposes but it is otherwise irrelevant
  1422. to the outcome of the matching process.
  1423. <p>
  1424. The filter function can access the following special variables from
  1425. its code body:
  1426. <dl>
  1427. <dt><code>CL-PPCRE::*STRING*</code></dt>
  1428. <dd>The target (a string) of the current matching process.</dd>
  1429. <dt><code>CL-PPCRE::*START-POS*</code> and
  1430. <code>CL-PPCRE::*END-POS*</code></dt>
  1431. <dd>The start and end (integers) indices
  1432. of the current matching process. These correspond to the
  1433. <code>START</code> and <code>END</code> keyword parameters
  1434. of <a href="#scan"><code>SCAN</code></a>.</dd>
  1435. <dt><code>CL-PPCRE::*REAL-START-POS*</code></dt>
  1436. <dd>The initial starting
  1437. position. This is only relevant for repeated scans (as in <a
  1438. href="#do-scans"><code>DO-SCANS</code></a>) where
  1439. <code>CL-PPCRE::*START-POS*</code> will be moved forward while
  1440. <code>CL-PPCRE::*REAL-START-POS*</code> won't. For normal scans the
  1441. value of this variable is <code>NIL</code>.</dd>
  1442. <dt><CODE>CL-PPCRE::*REG-STARTS*</CODE> and
  1443. <CODE>CL-PPCRE::*REG-ENDS*</CODE></dt>
  1444. <dd>Two simple vectors which denote the
  1445. start and end indices of registers within the regular expression. The
  1446. first register is indexed by&nbsp;0. If a register hasn't matched yet,
  1447. then its corresponding entry in <CODE>CL-PPCRE::*REG-STARTS*</CODE> is
  1448. <code>NIL</code>.</dd>
  1449. </dl>
  1450. These variables should be considered read-only. Do <em>not</em> change
  1451. these values unless you really know what you're doing!
  1452. <p>
  1453. Note that the names of the variables are not exported from the
  1454. <code>CL-PPCRE</code> package because there's no explicit guarantee
  1455. that they will be available in future releases. (Although after so
  1456. many years it is <em>very</em> unlikely that they'll go away...)
  1457. <pre>
  1458. * (defun my-info-filter (pos)
  1459. &quot;Show some info about the matching process.&quot;
  1460. (format t &quot;Called at position ~A~%&quot; pos)
  1461. (loop with dim = (array-dimension cl-ppcre::*reg-starts* 0)
  1462. for i below dim
  1463. for reg-start = (aref cl-ppcre::*reg-starts* i)
  1464. for reg-end = (aref cl-ppcre::*reg-ends* i)
  1465. do (format t &quot;Register ~A is currently &quot; (1+ i))
  1466. when reg-start
  1467. (write-string cl-ppcre::*string* nil
  1468. do (write-char #\')
  1469. (write-string cl-ppcre::*string* nil
  1470. :start reg-start :end reg-end)
  1471. (write-char #\')
  1472. else
  1473. do (write-string &quot;unbound&quot;)
  1474. do (terpri))
  1475. (terpri)
  1476. pos)
  1477. MY-INFO-FILTER
  1478. * (scan '(:sequence
  1479. (:register
  1480. (:greedy-repetition 0 nil
  1481. (:char-class (:range #\a #\z))))
  1482. (:filter my-info-filter 0) &quot;X&quot;)
  1483. &quot;bYcdeX&quot;)
  1484. Called at position 1
  1485. Register 1 is currently 'b'
  1486. Called at position 0
  1487. Register 1 is currently ''
  1488. Called at position 1
  1489. Register 1 is currently ''
  1490. Called at position 5
  1491. Register 1 is currently 'cde'
  1492. 2
  1493. 6
  1494. #(2)
  1495. #(5)
  1496. * (scan '(:sequence
  1497. (:register
  1498. (:greedy-repetition 0 nil
  1499. (:char-class (:range #\a #\z))))
  1500. (:filter my-info-filter 0) &quot;X&quot;)
  1501. &quot;bYcdeZ&quot;)
  1502. NIL
  1503. * (defun my-weird-filter (pos)
  1504. &quot;Only match at this point if either pos is odd and the character
  1505. we're looking at is lowercase or if pos is even and the next two
  1506. characters we're looking at are uppercase. Consume these characters if
  1507. there's a match.&quot;
  1508. (format t &quot;Trying at position ~A~%&quot; pos)
  1509. (cond ((and (oddp pos)
  1510. (&lt; pos cl-ppcre::*end-pos*)
  1511. (lower-case-p (char cl-ppcre::*string* pos)))
  1512. (1+ pos))
  1513. ((and (evenp pos)
  1514. (&lt; (1+ pos) cl-ppcre::*end-pos*)
  1515. (upper-case-p (char cl-ppcre::*string* pos))
  1516. (upper-case-p (char cl-ppcre::*string* (1+ pos))))
  1517. (+ pos 2))
  1518. (t nil)))
  1519. MY-WEIRD-FILTER
  1520. * (defparameter *weird-regex*
  1521. `(:sequence &quot;+&quot; (:filter ,#'my-weird-filter) &quot;+&quot;))
  1522. *WEIRD-REGEX*
  1523. * (scan *weird-regex* &quot;+A++a+AA+&quot;)
  1524. Trying at position 1
  1525. Trying at position 3
  1526. Trying at position 4
  1527. Trying at position 6
  1528. 5
  1529. 9
  1530. #()
  1531. #()
  1532. * (fmakunbound 'my-weird-filter)
  1533. MY-WEIRD-FILTER
  1534. * (scan *weird-regex* &quot;+A++a+AA+&quot;)
  1535. Trying at position 1
  1536. Trying at position 3
  1537. Trying at position 4
  1538. Trying at position 6
  1539. 5
  1540. 9
  1541. #()
  1542. #()
  1543. </pre>
  1544. Note that in the second call to <code>SCAN</code> our filter wasn't
  1545. invoked at all - it was optimized away by the regex engine because it
  1546. knew that it couldn't match. Also note that <code>*WEIRD-REGEX*</code>
  1547. still worked after we removed the global function definition of
  1548. <code>MY-WEIRD-FILTER</code> because the regular expression had
  1549. captured the original definition.
  1550. <p>
  1551. For more ideas about what you can do with filters see <a
  1552. href="http://common-lisp.net/pipermail/cl-ppcre-devel/2004-October/000069.html">this
  1553. thread</a> on the <a href="#mail">mailing list</a>.
  1554. <br>&nbsp;<br><h3><a name="perl" class=none>Compatibility with Perl</a></h3>
  1555. Depending on your Perl version you might encounter a couple of small
  1556. incompatibilities with Perl most of which aren't due to CL-PPCRE:
  1557. <h4><a name="empty" class=none>Empty strings instead of <code>undef</code> in <code>$1</code>, <code>$2</code>, etc.</a></h4>
  1558. (Cf. case #629 of <a href="#test"><code>perltestdata</code></a>.)
  1559. This is <a
  1560. href="http://groups.google.com/groups?threadm=87u1kw8hfr.fsf%40dyn164.dbdmedia.de">a
  1561. bug</a> in Perl 5.6.1 and earlier which has been fixed in 5.8.0.
  1562. <h4><a name="scope" class=none>Strange scoping of embedded modifiers</a></h4>
  1563. (Cf. case #430 of <a href="#test"><code>perltestdata</code></a>.)
  1564. This is <a
  1565. href="http://groups.google.com/groups?threadm=871y80dpqh.fsf%40bird.agharta.de">a
  1566. bug</a> in Perl 5.6.1 and earlier which has been fixed in 5.8.0.
  1567. <h4><a name="inconsistent" class=none>Inconsistent capturing of <code>$1</code>, <code>$2</code>, etc.</a></h4>
  1568. (Cf. case #662 of <a href="#test"><code>perltestdata</code></a>.)
  1569. This is <a
  1570. href="http://bugs6.perl.org/rt2/Ticket/Display.html?id=18708">a
  1571. bug</a> in Perl which hasn't been fixed yet.
  1572. <h4><a name="lookaround" class=none>Captured groups not available outside of look-aheads and look-behinds</a></h4>
  1573. (Cf. case #1439 of <a href="#test"><code>perltestdata</code></a>.)
  1574. Well, OK, this ain't a Perl bug. I just can't quite understand why
  1575. captured groups should only be seen within the scope of a look-ahead
  1576. or look-behind. For the moment, CL-PPCRE and Perl agree to
  1577. disagree... :)
  1578. <h4><a name="order" class=none>Alternations don't always work from left to right</a></h4>
  1579. (Cf. case #790 of <a href="#test"><code>perltestdata</code></a>.) I
  1580. also think this a Perl bug but I currently have lost the drive to
  1581. report it.
  1582. <h4><a name="uprops" class=none>Different names for Unicode properties</a></h4>
  1583. The names of <a href="#unicode">Unicode properties</a> are derived
  1584. from <a href="http://weitz.de/cl-unicode/">CL-UNICODE</a> and might
  1585. differ slightly from the names in Perl. Most of them should be
  1586. identical, though.
  1587. Also, <a href="http://weitz.de/cl-unicode/">CL-UNICODE</a> is based on
  1588. Unicode&nbsp;5.1 while your installed Perl version might be not.
  1589. <h4><a name="mac" class=none><code>&quot;\r&quot;</code> doesn't work with MCL</a></h4>
  1590. (Cf. case #9 of <a href="#test"><code>perltestdata</code></a>.) For
  1591. some strange reason that I don't understand MCL translates
  1592. <code>#\Return</code> to <code>(CODE-CHAR 10)</code> while MacPerl
  1593. translates <code>&quot;\r&quot;</code> to <code>(CODE-CHAR
  1594. 13)</code>. Hmmm...
  1595. <h4><a name="alpha" class=none>What about <code>&quot;\w&quot;</code>?</a></h4>
  1596. CL-PPCRE uses <a
  1597. href="http://www.lispworks.com/documentation/HyperSpec/Body/f_alphan.htm"><code>ALPHANUMERICP</code></a>
  1598. to decide whether a character matches Perl's
  1599. <code>&quot;\w&quot;</code>, so depending on your CL implementation
  1600. you might encounter differences between Perl and CL-PPCRE when
  1601. matching non-ASCII characters.
  1602. <br>&nbsp;<br><h3><a name="bugs" class=none>Bugs and problems</a></h3>
  1603. <h4><a name="quote" class=none><code>&quot;\Q&quot;</code> doesn't work, or does it?</a></h4>
  1604. In Perl the following code works as expected, i.e. it prints <code>1</code>.
  1605. <pre>
  1606. #!/usr/bin/perl -l
  1607. $a = '\E*';
  1608. print 1
  1609. if '\E*\E*' =~ /(?:\Q$a\E){2}/;
  1610. </pre>
  1611. If you try to do something similar in CL-PPCRE, you get an error:
  1612. <pre>
  1613. * (let ((*allow-quoting* t)
  1614. (a &quot;\\E*&quot;))
  1615. (scan (concatenate 'string &quot;(?:\\Q&quot; a &quot;\\E){2}&quot;) &quot;\\E*\\E*&quot;))
  1616. Quantifier '*' not allowed at position 3 in string &quot;(?:*\\E){2}&quot;
  1617. </pre>
  1618. The error message might give you a hint as to why this happens:
  1619. Because <a href="#*allow-quoting*"><code>*ALLOW-QUOTING*</code></a>
  1620. was <em>true</em> the concatenated string was pre-processed before it
  1621. was fed to CL-PPCRE's parser - the result of this pre-processing is
  1622. <code>&quot;(?:*\\E){2}&quot;</code> because the
  1623. <code>&quot;\\E&quot;</code> in the string <code>A</code> was taken to
  1624. be the end of the quoted section started by
  1625. <code>&quot;\\Q&quot;</code>. This cannot happen in Perl due to its
  1626. complicated interpolation rules - see <code>man&nbsp;perlop</code> for
  1627. the scary details. It <em>can</em> happen in CL-PPCRE, though.
  1628. Bummer!
  1629. <p>
  1630. What gives? <code>&quot;\\Q...\\E&quot;</code> in CL-PPCRE should only
  1631. be used in literal strings. If you want to quote arbitrary strings,
  1632. try <a href="http://weitz.de/cl-interpol/">CL-INTERPOL</a> or use <a
  1633. href="#quote-meta-chars"><code>QUOTE-META-CHARS</code></a>:
  1634. <pre>
  1635. * (let ((a &quot;\\E*&quot;))
  1636. (scan (concatenate 'string &quot;(?:&quot; (quote-meta-chars a) &quot;){2}&quot;) &quot;\\E*\\E*&quot;))
  1637. 0
  1638. 6
  1639. #()
  1640. #()
  1641. </pre>
  1642. Or, even better and Lisp-ier, use the <a href="#create-scanner2">S-expression syntax</a> instead - no need for quoting in this case:
  1643. <pre>
  1644. * (let ((a "\\E*"))
  1645. (scan `(:greedy-repetition 2 2 ,a) "\\E*\\E*"))
  1646. 0
  1647. 6
  1648. #()
  1649. #()
  1650. </pre>
  1651. <h4><a name="backslash" class=none>Backslashes may confuse you...</a></h4>
  1652. <pre>
  1653. * (let ((a &quot;y\\y&quot;))
  1654. (scan a a))
  1655. NIL
  1656. </pre>
  1657. You didn't expect this to yield <code>NIL</code>, did you? Shouldn't something like <code>(SCAN&nbsp;A&nbsp;A)</code> always return a true value? No, because the first and the second argument to <code>SCAN</code> are handled differently: The first argument is fed to CL-PPCRE's parser and is treated like a Perl regular expression. In particular, the parser "sees" <code>\y</code> and converts it to <code>y</code> because <code>\y</code> has no special meaning in regular expressions. So, the regular expression is the constant string <code>"yy"</code>. But the second argument isn't converted - it is left as is, i.e. it's equivalent to Perl's <code>'y\y'</code>. In other words, this example would be equivalent to the Perl code
  1658. <pre>
  1659. 'y\y' =~ /y\y/;
  1660. </pre>
  1661. or to
  1662. <pre>
  1663. $a = 'y\y';
  1664. $a =~ /$a/;
  1665. </pre>
  1666. which should explain why it doesn't match.
  1667. <p>
  1668. Still confused? You might want to try <a href="http://weitz.de/cl-interpol/">CL-INTERPOL</a>.
  1669. <br>&nbsp;<br><h3><a class=none name="allegro">AllegroCL compatibility mode</a></h3>
  1670. Since autumn 2004 <a
  1671. href="http://www.franz.com/products/allegrocl/">AllegroCL</a> offers
  1672. <a
  1673. href="http://www.franz.com/support/documentation/7.0/doc/regexp.htm">a
  1674. new regular expression API</a> with a syntax very similar to
  1675. CL-PPCRE. Although CL-PPCRE is quite fast already, AllegroCL's engine will
  1676. most likely be even faster (but only on AllegroCL, of course). However, you might want to
  1677. stick to CL-PPCRE because you have a "legacy" application or because
  1678. you want your code to be portable to other Lisp implementations.
  1679. Therefore, beginning from version 1.2.0, CL-PPCRE offers a
  1680. "compatibility mode" where you can continue using the CL-PPCRE API as
  1681. described <a href="#dict">above</a> but deploy the AllegroCL regex
  1682. engine under the hood. (The details are: Calls to <a
  1683. href="#create-scanner"><code>CREATE-SCANNER</code></a> and <a
  1684. href="#scan"><code>SCAN</code></a> are dispatched to their AllegroCL
  1685. counterparts <a
  1686. href="http://www.franz.com/support/documentation/7.0/doc/operators/excl/compile-re.htm"><code>EXCL:COMPILE-RE</code></a>
  1687. and <a
  1688. href="http://www.franz.com/support/documentation/7.0/doc/operators/excl/match-re.htm"><code>EXCL:MATCH-RE</code></a>
  1689. while everything else is left as is.)
  1690. <p>
  1691. The advantage of this mode is that you'll get a much smaller image and
  1692. most likely faster code. (But note that CL-PPCRE needs to do a small amount of work to massage AllegroCL's output into the format expected by CL-PPCRE.) The downside is that your code won't be
  1693. fully compatible with CL-PPCRE anymore. Here are some of the
  1694. differences (most of which probably don't matter very often):
  1695. <ul>
  1696. <li>The AllegroCL engine doesn't offer <a
  1697. href="#parse-tree-synonym">parse tree synonyms</a> and <a href="#filters">filters</a>.
  1698. <li>The AllegroCL engine <a href="http://www.franz.com/support/documentation/8.0/doc/regexp.htm#regexp-new-compatibility-2">will choke on some regular expressions involving curly braces</a> that are accepted by Perl and CL-PPCRE's native engine.
  1699. <li>The AllegroCL engine's case-folding mode switch (which is used instead of CL-PPCRE's <a href="#create-scanner"><code>:CASE-INSENSITIVE</code> keyword parameter</a>) <a href="http://www.franz.com/support/documentation/8.0/doc/regexp.htm#regexp-new-matching-2">is currently only effective for ASCII characters</a>.
  1700. <li>The AllegroCL engine <a href="http://www.franz.com/support/documentation/8.0/doc/regexp.htm#regexp-new-compatibility-2">doesn't support</a> <a href="#*allow-quoting*">quoting of metacharacters</a>.
  1701. <li>In AllegroCL compatibility mode compiled regular expressions (as returned by <a href="#create-scanner"><code>CREATE-SCANNER</code></a>) aren't functions but structures.
  1702. <li>The AllegroCL engine <a href="http://www.franz.com/support/documentation/8.0/doc/regexp.htm#regexp-new-compatibility-2">doesn't support</a> <a href="#*property-resolver*">named properties</a>.
  1703. </ul>
  1704. For more details about the AllegroCL engine and possible deviations from CL-PPCRE see the <a href="http://www.franz.com/support/documentation/8.0/doc/regexp.htm">documentation</a> at the <a href="http://www.franz.com/">Franz Inc. website</a>.
  1705. <p>
  1706. To use the AllegroCL compatibility mode you have to
  1707. <pre>
  1708. (push :use-acl-regexp2-engine *features*)
  1709. </pre>
  1710. <em>before</em> you compile CL-PPCRE.
  1711. <br>&nbsp;<br><h3><a class=none name="blabla">Hints, comments, performance considerations</a></h3>
  1712. Here are, in no particular order, a couple of things about CL-PPCRE
  1713. and regular expressions in general that you might or might not want to
  1714. read.
  1715. <ul>
  1716. <li>A lot of hackers (especially users of Perl and other scripting
  1717. languages) think that regular expressions are the greatest thing
  1718. since sliced bread and use it for almost everything. That is just
  1719. plain wrong. Other hackers (especially Lispers) tend to think that
  1720. regular expressions are the work of the devil and try to avoid them
  1721. at all cost. That's also wrong. Regular expressions are a handy
  1722. and useful addition to your toolkit which you should use when
  1723. appropriate - you should just try to figure out first <em>if</em>
  1724. they're appropriate for the task at hand.
  1725. <li>If you're concerned about the string syntax of regular
  1726. expressions which can look like line noise and is really hard to
  1727. read for long expressions, consider using
  1728. CL-PPCRE's <a href="#create-scanner2">S-expression syntax</a>
  1729. instead. It is less error-prone and you don't have to worry about
  1730. escaping characters. It is also easier to manipulate
  1731. programmatically.
  1732. <li>For alternations, order is important. The general rule is that
  1733. the regex engine tries from left to right and tries to match as much
  1734. as possible.
  1735. <pre>
  1736. CL-USER 1 > (scan-to-strings "<=|<" "<=")
  1737. "<="
  1738. #()
  1739. CL-USER 2 > (scan-to-strings "<|<=" "<=")
  1740. "<"
  1741. #()
  1742. </pre>
  1743. <li><a class=none name="compiler-macro">CL-PPCRE</a>
  1744. uses <a href="http://www.lispworks.com/documentation/HyperSpec/Body/03_bba.htm">compiler
  1745. macros</a> to pre-compile scanners
  1746. at <a href=="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_l.htm#load_time">load
  1747. time</a> if possible. This happens if the compiler can determine
  1748. that the regular expression (no matter if it's a string or an
  1749. S-expression)
  1750. is <a href="http://www.lispworks.com/documentation/HyperSpec/Body/f_consta.htm">constant</a>
  1751. at <a href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_c.htm#compile_time">compile
  1752. time</a> and is intended to save the time for creating scanners
  1753. at <a href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_e.htm#execution_time">execution
  1754. time</a> (probably creating the same scanner over and over in a
  1755. loop). Make sure you don't prevent the compiler from helping you.
  1756. For example, a definition like this one is usually not a good idea:
  1757. <pre>
  1758. (defun regex-match (regex target)
  1759. <font color=orange>;; don't do that!</font>
  1760. (scan regex target))
  1761. </pre>
  1762. <li>If you want to search for a substring in a large string or if
  1763. you search for the same string very
  1764. often, <a href="#scan"><code>SCAN</code></a> will usually be faster
  1765. than Common
  1766. Lisp's <a href="http://www.lispworks.com/documentation/HyperSpec/Body/f_search.htm"><code>SEARCH</code></a>
  1767. if you <a href="#*use-bmh-matchers*">use BMH matchers</a>. However,
  1768. this only makes sense if scanner creation time is not the
  1769. limiting factor, i.e. if the search target is <em>very</em> large or
  1770. if you're using the same scanner very often.
  1771. <li>Complementary to the last hint, <em>don't</em> use regular
  1772. expressions for one-time searches for constant strings. That's a
  1773. terrible waste of resources.
  1774. <li><a href="#*use-bmh-matchers*"><code>*USE-BMH-MATCHERS*</code></a> together with a large value for
  1775. <a href="#*regex-char-code-limit*"><code>*REGEX-CHAR-CODE-LIMIT*</code></a>
  1776. can lead to huge scanners.
  1777. <li>A character class is by default translated into a sequence of
  1778. tests exactly as you might expect. For
  1779. example, <code>"[af-l\\d]"</code> means to test if the character is
  1780. equal to <code>#\a</code>, then to test if it's
  1781. between <code>#\f</code> and <code>#\l</code>, then if it's a digit.
  1782. There's by default no attempt to remove redundancy (as
  1783. in <code>"[a-ge-kf]"</code>) or to otherwise optimize these tests
  1784. for speed. However, you can play
  1785. with <a href="#*optimize-char-classes*"><code>*OPTIMIZE-CHAR-CLASSES*</code></a>
  1786. if you've identified character classes as a bottleneck and want to
  1787. make sure that you have <em>O(1)</em> test functions.
  1788. <li>If you know that the expression you're looking for is anchored,
  1789. use anchors in your regex. This can help the engine a lot to make
  1790. your scanners more efficient.
  1791. <li>In addition to anchors, constant strings at the start or end of a
  1792. regular expression can help the engine to quickly scan a string.
  1793. Note that for example <code>"(a-d|aebf)"</code>
  1794. and <code>"ab(cd|ef)"</code> are equivalent, but only the second
  1795. form has a constant start the regex engine can recognize.
  1796. <li>Try to avoid alternations if possible or at least factor them
  1797. out as in the example above.
  1798. <li>If neither anchors nor constant strings are in sight, maybe
  1799. "standalone" (sometimes also called "possessive") regular
  1800. expressions can be helpful. Try the following:
  1801. <pre>
  1802. (let ((target (make-string 10000 :initial-element #\a))
  1803. (scanner-1 (create-scanner "a*\\d"))
  1804. (scanner-2 (create-scanner "(?>a*)\\d")))
  1805. (time (scan scanner-1 target))
  1806. (time (scan scanner-2 target)))
  1807. </pre>
  1808. <li>Consider using <a href="#create-scanner">"single-line mode"</a>
  1809. if it makes sense for your task. By default (following Perl's
  1810. practice), a dot means to search for any character <em>except</em>
  1811. line breaks. In single-line mode a dot searches for <em>any</em>
  1812. character which in some cases means that large parts of the target
  1813. can actually be skipped. This can be vastly more efficient for
  1814. large targets.
  1815. <li>Don't use capturing register groups where a non-capturing group
  1816. would do, i.e. <em>only</em> use registers if you need to refer to
  1817. them later. If you use a register, each scan process needs to
  1818. allocate space for it and update its contents (possibly many times)
  1819. until it's finished. (In Perl parlance - use <code>"(?:foo)"</code> instead of
  1820. <code>"(foo)"</code> whenever possible.)
  1821. <li>In addition to what has been said in the last hint, note that
  1822. Perl semantics force the regex engine to report the <em>last</em>
  1823. match for each register. This implies for example
  1824. that <code>"([a-c])+"</code> and <code>"[a-c]*([a-c])"</code> have
  1825. exactly the same semantics but completely different performance
  1826. characteristics. (Actually, in some cases CL-PPCRE automatically
  1827. converts expressions from the first type into the second type.
  1828. That's not always possible, though, and you shouldn't rely on it.)
  1829. <li>By default, repetitions are "greedy" in Perl (and thus in
  1830. CL-PPCRE). This has an impact on performance and also on the actual
  1831. outcome of a scan. Look at your repetitions and ponder if a greedy
  1832. repetition is really what you want.
  1833. </ul>
  1834. <br>&nbsp;<br><h3><a class=none name="ack">Acknowledgements</a></h3>
  1835. Although I didn't use their code, I was heavily inspired by looking at
  1836. the Scheme/CL regex implementations of <a
  1837. href="http://www.ccs.neu.edu/home/dorai/pregexp/pregexp.html">Dorai
  1838. Sitaram</a> and <a
  1839. href="http://www.geocities.com/mparker762/clawk#regex">Michael
  1840. Parker</a>. Also, the nice folks from CMUCL's <a
  1841. href="http://www.cons.org/cmucl/support.html">mailing list</a> as well
  1842. as the output of Perl's <code>use re &quot;debug&quot;</code> pragma
  1843. have been very helpful in optimizing the scanners created by CL-PPCRE.
  1844. <p>The list of people who participated in this project in one way or
  1845. the other has grown too long to maintain it here. See
  1846. the <a href="http://weitz.de/cl-ppcre/CHANGELOG">ChangeLog</a> for all
  1847. the people who helped with patches, bug reports, or in other ways.
  1848. Thanks to all of them!
  1849. <p>
  1850. Thanks to the guys at
  1851. &quot;<a href="http://www.weinhandel-ottensen.de/">Caf&eacute;
  1852. Ol&eacute;</a>&quot;
  1853. in <a href="http://en.wikipedia.org/wiki/Hamburg">Hamburg</a> where I
  1854. wrote most of the 0.1.0&nbsp;release and thanks to my wife for lending
  1855. me her PowerBook to test early versions of CL-PPCRE with MCL and
  1856. OpenMCL.
  1857. <p>
  1858. $Header: /usr/local/cvsrep/cl-ppcre/doc/index.html,v 1.200 2009/10/28 07:36:31 edi Exp $
  1859. <p><a href="http://weitz.de/index.html">BACK TO MY HOMEPAGE</a>
  1860. </body>
  1861. </html>