PageRenderTime 57ms CodeModel.GetById 14ms RepoModel.GetById 0ms app.codeStats 0ms

/lib/antlr-2.7.5/doc/lexer.html

https://github.com/boo/boo-lang
HTML | 1599 lines | 1526 code | 73 blank | 0 comment | 0 complexity | b885c19a7b017b3a0a0177f91d2a75cf MD5 | raw file
Possible License(s): GPL-2.0

Large files files are truncated, but you can click here to view the full file

  1. <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
  2. <html>
  3. <head>
  4. <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  5. <title>Lexical Analysis with ANTLR</title>
  6. </head>
  7. <body bgcolor="#FFFFFF">
  8. <h2><a id="Lexical_Analysis_with_ANTLR" name="Lexical_Analysis_with_ANTLR"name="_bb1">Lexical Analysis with ANTLR</a></h2>
  9. <p>
  10. A <em>lexer</em> (often called a scanner) breaks up an input stream of characters into vocabulary symbols for a parser, which applies a grammatical structure to that symbol stream. Because ANTLR employs the same recognition mechanism for lexing, parsing, and tree parsing, ANTLR-generated lexers are much stronger than DFA-based lexers such as those generated by DLG (from PCCTS 1.33) and lex.
  11. </p>
  12. <p>
  13. The increase in lexing power comes at the cost of some inconvenience in lexer specification and indeed requires a serious shift your thoughts about lexical analysis. See a <a href="lexer.html#dfacompare">comparison of LL(k) and DFA-based lexical analysis</a>.
  14. </p>
  15. <p>
  16. ANTLR generates predicated-LL(k) lexers, which means that you can have semantic and syntactic predicates and use k&gt;1 lookahead. The other advantages are:
  17. <ul>
  18. <li>
  19. You can actually read and debug the output as its very similar to what you would build by hand.
  20. </li>
  21. <li>
  22. The syntax for specifying lexical structure is the same for lexers, parsers, and tree parsers.
  23. </li>
  24. <li>
  25. You can have actions executed during the recognition of a single token.
  26. </li>
  27. <li>
  28. You can recognize complicated tokens such as HTML tags or &quot;executable&quot; comments like the javadoc <font face="Courier New">@</font>-tags inside <font size="2" face="Courier New">/** ... */</font> comments. The lexer has a stack, unlike a DFA, so you can match nested structures such as nested comments.
  29. </li>
  30. </ul>
  31. <p>
  32. The overall structure of a lexer is:
  33. </p>
  34. <pre>class MyLexer extends Lexer;
  35. options {
  36. <em>some options</em>
  37. }
  38. {<em>
  39. lexer class members</em>
  40. }
  41. <em>lexical rules</em></pre>
  42. <h3><a id="" name=""name="_bb2"></a><a name="lexicalrules">Lexical Rules</a></h3>
  43. <p>
  44. Rules defined within a lexer grammar must have a name beginning with an uppercase letter. These rules implicitly match characters on the input stream instead of tokens on the token stream. Referenced grammar elements include token references (implicit lexer rule references), characters, and strings. Lexer rules are processed in the exact same manner as parser rules and, hence, may specify arguments and return values; further, lexer rules can also have local variables and use recursion. The following rule defines a rule called <font size="2" face="Courier New">ID</font> that is available as a token type in the parser.
  45. </p>
  46. <pre>ID : ( 'a'..'z' )+
  47. ;</pre>
  48. <p>
  49. This rule would become part of the resulting lexer and would appear as a method called <font size="2" face="Courier New">mID()</font> that looks sort of like this:
  50. <tt><pre>
  51. public final void mID(...)
  52. throws RecognitionException,
  53. CharStreamException, TokenStreamException
  54. {
  55. ...
  56. _loop3:
  57. do {
  58. if (((LA(1) >= 'a' && LA(1) <= 'z'))) {
  59. matchRange('a','z');
  60. }
  61. } while (...);
  62. ...
  63. }
  64. </pre></tt>
  65. <p>
  66. It is a good idea to become familiar with ANTLR's output--the generated lexers are human-readable and make a lot of concepts more transparent.
  67. <h4><a id="Skipping_characters" name="Skipping_characters"name="_bb4">Skipping characters</a></h4>
  68. <p>
  69. To have the characters matched by a rule ignored, set the token type to<font size="2" face="Courier New"> Token.SKIP</font>. For example,
  70. </p>
  71. <pre>WS : ( ' ' | '\t' | '\n' { newline(); } | '\r' )+
  72. { $setType(Token.SKIP); }
  73. ;</pre>
  74. Skipped tokens force the lexer to reset and try for another
  75. token. Skipped tokens are never sent back to the parser.
  76. <h4><a id="Distinguishing_between_lexer_rules" name="Distinguishing_between_lexer_rules"name="_bb5">Distinguishing between lexer rules</a></h4>
  77. <p>As with most lexer generators like <tt>lex</tt>, you simply list a
  78. set of lexical rules that match tokens. The tool then automatically
  79. generates code to map the next input character(s) to a rule likely to
  80. match. Because ANTLR generates recursive-descent lexers just like it
  81. does for parsers and tree parsers, ANTLR automatically generates a
  82. method for a fictitious rule called <tt>nextToken</tt> that predicts
  83. which of your lexer rules will match upon seeing the character
  84. lookahead. You can think of this method as just a big "switch" that
  85. routes recognition flow to the appropriate rule (the code may be much
  86. more complicated than a simple <tt>switch</tt>-statement, however).
  87. Method <tt>nextToken</tt> is the only method of <tt>TokenStream</tt>
  88. (in Java):
  89. <tt><pre>
  90. public interface TokenStream {
  91. public Token nextToken() throws TokenStreamException;
  92. }
  93. </pre></tt>
  94. A parser feeds off a lookahead buffer and the buffer pulls from any
  95. <tt>TokenStream</tt>.
  96. Consider the following two ANTLR lexer rules:
  97. <tt><pre>
  98. INT : ('0'..'9')+;
  99. WS : ' ' | '\t' | '\r' | '\n';
  100. </pre></tt>
  101. <p>
  102. You will see something like the following method in lexer generated by
  103. ANTLR:
  104. <tt><pre>
  105. public Token nextToken() throws TokenStreamException {
  106. ...
  107. for (;;) {
  108. Token _token = null;
  109. int _ttype = Token.INVALID_TYPE;
  110. resetText();
  111. ...
  112. switch (LA(1)) {
  113. case '0': case '1': case '2': case '3':
  114. case '4': case '5': case '6': case '7':
  115. case '8': case '9':
  116. mINT(); break;
  117. case '\t': case '\n': case '\r': case ' ':
  118. mWS(); break;
  119. default: // error
  120. }
  121. ...
  122. }
  123. }
  124. </pre></tt>
  125. <p> <b>What happens when the same character predicts more than a single
  126. lexical rule</b>? ANTLR generates an nondeterminism warning between the
  127. offending rules, indicating you need to make sure your rules do not
  128. have common left-prefixes. ANTLR does not follow the common lexer
  129. rule of &quot;first definition wins&quot; (the alternatives within a
  130. rule, however, still follow this rule). Instead, sufficient power is
  131. given to handle the two most common cases of ambiguity, namely
  132. &quot;keywords vs. identifiers&quot;, and &quot;common prefixes&quot;;
  133. and for especially nasty cases you can use syntactic or semantic
  134. predicates.</p>
  135. <p> <b>What if you want to break up the definition of a complicated
  136. rule into multiple rules</b>? Surely you don't want every rule to
  137. result in a complete Token object in this case. Some rules are only
  138. around to help other rules construct tokens. To distinguish these
  139. "helper" rules from rules that result in tokens, use the
  140. <tt>protected</tt> modifier. This overloading of the access-visibility
  141. Java term occurs because if the rule is not visible, it cannot be
  142. "seen" by the parser (yes, this nomeclature sucks). See also <a
  143. href="http://www.jguru.com/faq/view.jsp?EID=125"><b>What is a
  144. "protected" lexer rule</b></a>.
  145. <p>
  146. Another, more practical, way to look at this is to note that only
  147. non-protected rules get called by <tt>nextToken</tt> and, hence, only
  148. non-protected rules can generate tokens that get shoved down the
  149. TokenStream pipe to the parser.
  150. <h4><a id="Return_values" name="Return_values"name="_bb3">Return values</a></h4>
  151. <p>
  152. All rules return a token object (conceptually) automatically, which contains the text matched for the rule and its token type at least.&nbsp; To specify a user-defined return value, define a return value and set it in an action:
  153. </p>
  154. <pre>protected
  155. INT returns [int v]
  156. : (&#145;0&#146;..&#146;9&#146;)+ { v=Integer.valueOf($getText); }
  157. ;</pre>
  158. <p>
  159. Note that only protected rules can have a return type since regular lexer rules generally are invoked by <tt>nextToken()</tt> and the parser cannot access the return value, leading to confusion.
  160. </p>
  161. <h3><a id="Predicated-LL(k)_Lexing" name="Predicated-LL(k)_Lexing"name="_predllk">Predicated-LL(k) Lexing</a></h3>
  162. <p>
  163. <p>
  164. Lexer rules allow your parser to match <i>context-free</i> structures on the input character stream as opposed to the much weaker <i>regular</i> structures (using a DFA--deterministic finite automaton). For example, consider that matching nested curly braces with a DFA must be done using a counter whereas nested curlies are trivially matched with a context-free grammar:
  165. </p>
  166. <pre><tt>ACTION
  167. : '{' ( ACTION | ~'}' )* '}'
  168. ;</tt> </pre>
  169. <p>
  170. The recursion from rule ACTION to ACTION, of course, is the dead giveaway that this is not an ordinary lexer rule.
  171. </p>
  172. <p>
  173. Because the same algorithms are used to analyze lexer and parser rules, lexer rules may use more than a single symbol of lookahead, can use semantic predicates, and can specify syntactic predicates to look arbitrarily ahead, thus, providing recognition capabilities beyond the LL(k) languages into the <i>context-sensitive</i>. Here is a simple example that requires k&gt;1 lookahead:
  174. </p>
  175. <pre><tt>ESCAPE_CHAR
  176. : '\\' 't' // two char of lookahead needed,
  177. | '\\' 'n' // due to common left-prefix
  178. ;</tt> </pre>
  179. <p>
  180. To illustrate the use of syntactic predicates for lexer rules, consider the problem of distinguishing between floating point numbers and ranges in Pascal. Input <tt>3..4</tt> must be broken up into 3 tokens: <tt>INT</tt>, <tt>RANGE</tt>, followed by <tt>INT</tt>. Input <tt>3.4</tt>, on the other hand, must be sent to the parser as a <tt>REAL</tt>. The trouble is that the series of digits before the first <tt>'.'</tt> can be arbitrarily long. The scanner then must consume the first <tt>'.'</tt> to see if the next character is a <tt>'.'</tt>, which would imply that it must back up and consider the first series of digits an integer. Using a non-backtracking lexer makes this task very difficult; without bracktracking, your lexer has to be able to respond with more than a single token at one time. However, a syntactic predicate can be used to specify what arbitrary lookahead is necessary:
  181. </p>
  182. <pre><tt>class Pascal extends Parser;
  183. prog: INT
  184. ( RANGE INT
  185. { System.out.println(&quot;INT .. INT&quot;); }
  186. | EOF
  187. { System.out.println(&quot;plain old INT&quot;); }
  188. )
  189. | REAL { System.out.println(&quot;token REAL&quot;); }
  190. ;
  191. class LexPascal extends Lexer;
  192. WS : (' '
  193. | '\t'
  194. | '\n'
  195. | '\r')+
  196. { $setType(Token.SKIP); }
  197. ;
  198. protected
  199. INT : ('0'..'9')+
  200. ;
  201. protected
  202. REAL: INT '.' INT
  203. ;
  204. RANGE
  205. : &quot;..&quot;
  206. ;
  207. RANGE_OR_INT
  208. : ( INT &quot;..&quot; ) =&gt; INT { $setType(INT); }
  209. | ( INT '.' ) =&gt; REAL { $setType(REAL); }
  210. | INT { $setType(INT); }
  211. ;</tt> </pre>
  212. <p>
  213. ANTLR lexer rules are even able to handle FORTRAN assignments and other difficult lexical constructs. Consider the following <tt>DO</tt> loop:
  214. </p>
  215. <pre><tt>DO 100 I = 1,10</tt></pre>
  216. <p>
  217. If the comma were replaced with a period, the loop would become an assignment to a weird variable called &quot;<tt>DO100I</tt>&quot;:
  218. </p>
  219. <pre><tt>DO 100 I = 1.10</tt></pre>
  220. <p>
  221. The following rules correctly differentiate the two cases:
  222. </p>
  223. <pre>DO_OR_VAR
  224. : (DO_HEADER)=&gt; &quot;DO&quot; { <tt>$setType(</tt>DO); }
  225. | VARIABLE { <tt>$setType(</tt>VARIABLE); }
  226. ;
  227. protected
  228. DO_HEADER
  229. options { ignore=WS; }
  230. : &quot;DO&quot; INT VARIABLE '=' EXPR ','
  231. ;
  232. protected INT : ('0'..'9')+;
  233. protected WS : ' ';
  234. protected
  235. VARIABLE
  236. : 'A'..'Z'
  237. ('A'..'Z' | ' ' | '0'..'9')*
  238. { /* strip space from end */ }
  239. ;
  240. // just an int or float
  241. protected EXPR
  242. : INT ( '.' (INT)? )?
  243. ;
  244. </pre>
  245. <p> The previous examples discuss differentiating lexical rules via
  246. lots of lookahead (fixed k or arbitrary). There are other situations
  247. where you have to turn on and off certain lexical rules (making
  248. certain tokens valid and invalid) depending on prior context or
  249. semantic information. One of the best examples is matching a token
  250. only if it starts on the left edge of a line (i.e., column 1).
  251. Without being able to test the state of the lexer's column counter,
  252. you cannot do a decent job. Here is a simple <tt>DEFINE</tt> rule
  253. that is only matched if the semantic predicate is true.
  254. <tt><pre>
  255. DEFINE
  256. : {getColumn()==1}? "#define" ID
  257. ;
  258. </pre></tt>
  259. <p> Semantic predicates on the <b>left-edge</b> of
  260. <b>single-alternative</b> lexical rules get hoisted into the
  261. <tt>nextToken</tt> prediction mechanism. Adding the predicate to a
  262. rule makes it so that it is not a candidate for recognition until the
  263. predicate evaluates to true. In this case, the method for
  264. <tt>DEFINE</tt> would never be entered, even if the lookahead
  265. predicted <tt>#define</tt>, if the column &gt; 1.
  266. <p> Another useful example involves context-sensitive recognition such
  267. as when you want to match a token only if your lexer is in a particular
  268. context (e.g., the lexer previously matched some trigger sequence). If
  269. you are matching tokens that separate rows of data such as
  270. "<tt>----</tt>", you probably only want to match this if the "begin
  271. table" sequence has been found.
  272. <tt><pre>
  273. BEGIN_TABLE
  274. : '[' {this.inTable=true;} // enter table context
  275. ;
  276. ROW_SEP
  277. : {this.inTable}? "----"
  278. ;
  279. END_TABLE
  280. : ']' {this.inTable=false;} // exit table context
  281. ;
  282. </pre></tt>
  283. This predicate hoisting ability is another way to simulate lexical
  284. states from DFA-based lexer generators like <tt>lex</tt>, though
  285. predicates are much more powerful. (You could even turn on certain
  286. rules according to the phase of the moon). ;)
  287. <h3><a id="Keywords_and_literals" name="Keywords_and_literals"name="_bb7">Keywords and literals</a></h3>
  288. <p>
  289. Many languages have a general &quot;identifier&quot; lexical rule, and keywords that are special cases of the identifier pattern. A typical identifier token is defined as:
  290. </p>
  291. <pre><tt>ID : LETTER (LETTER | DIGIT)*;</tt></pre>
  292. <p>
  293. This is often in conflict with keywords. ANTLR solves this problem by letting you put fixed keywords into a literals table. The literals table (which is usally implemented as a hash table in the lexer) is checked after each token is matched, so that the literals effectively override the more general identifier pattern. Literals are created in one of two ways. First, any double-quoted string used in a parser is automatically entered into the literals table of the associated lexer. Second, literals may be specified in the lexer grammar by means of the <a href="options.html#literal">literal option</a>. In addition, the <a href="options.html#testLiterals">testLiterals option</a> gives you fine-grained control over the generation of literal-testing code.
  294. </p>
  295. <h3><a id="Common_prefixes" name="Common_prefixes"name="_bb8">Common prefixes</a></h3>
  296. <p>
  297. Fixed-length common prefixes in lexer rules are best handled by increasing the <a href="options.html#k">lookahead depth</a> of the lexer. For example, some operators from Java:
  298. </p>
  299. <pre><tt>class MyLexer extends Lexer;
  300. options {
  301. k=4;
  302. }
  303. GT : &quot;&gt;&quot;;
  304. GE : &quot;&gt;=&quot;;
  305. RSHIFT : &quot;&gt;&gt;&quot;;
  306. RSHIFT_ASSIGN : &quot;&gt;&gt;=&quot;;
  307. UNSIGNED_RSHIFT : &quot;&gt;&gt;&gt;&quot;;
  308. UNSIGNED_RSHIFT_ASSIGN : &quot;&gt;&gt;&gt;=&quot;;</tt></pre>
  309. <h3><a id="Token_definition_files" name="Token_definition_files"name="_bb9">Token definition files</a></h3>
  310. <p>
  311. Token definitions can be transferred from one grammar to another by way of token definition files. This is accomplished using the <a href="options.html#importVocab">importVocab</a> and <a href="options.html#exportVocab">exportVocab</a> options.
  312. </p>
  313. <h3><a id="Character_classes" name="Character_classes"name="_bb10">Character classes</a></h3>
  314. <p>
  315. Use the <font face="Courier New">~</font> operator to invert a character or set of characters.&nbsp; For example, to match any character other than newline, the following rule references ~'\n'.
  316. </p>
  317. <pre>SL_COMMENT: &quot;//&quot; (~'\n')* '\n';</pre>
  318. <p>
  319. The <font face="Courier New">~</font> operator also inverts a character set:
  320. </p>
  321. <pre>NOT_WS: ~(' ' | '\t' | '\n' | '\r');</pre>
  322. <p>
  323. The range operator can be used to create sequential character sets:
  324. </p>
  325. <pre>DIGIT : '0'..'9' ;</pre>
  326. <h3><a id="Token_Attributes" name="Token_Attributes"name="_bb11">Token Attributes</a></h3>
  327. <p>
  328. See the next section.
  329. </p>
  330. <h3><a id="" name=""name="_bb12"></a><a name="lexicallookahead">Lexical lookahead and the end-of-token symbol</a></h3>
  331. <p>
  332. A unique situation occurs when analyzing lexical grammars, one which is similar to the end-of-file condition when analyzing regular grammars.&nbsp; Consider how you would compute lookahead sets for the ('b' | ) subrule in following rule B:
  333. </p>
  334. <pre>class L extends Lexer;
  335. A : B 'b'
  336. ;
  337. protected // only called from another lex rule
  338. B : 'x' ('b' | )
  339. ;</pre>
  340. <p>
  341. The lookahead for the first alternative of the subrule is clearly 'b'.&nbsp; The second alternative is empty and the lookahead set is the set of all characters that can follow references to the subrule, which is the follow set for rule B.&nbsp; In this case, the 'b' character follows the reference to B and is therefore the lookahead set for the empty alt indirectly.&nbsp; Because 'b' begins both alternatives, the parsing decision for the subrule is nondeterminism or ambiguous as we sometimes say.&nbsp; ANTLR will justly generate a warning for this subrule (unless you use the <font face="Courier New">warnWhenFollowAmbig</font> option).
  342. </p>
  343. <p>
  344. Now, consider what would make sense for the lookahead if rule A did not exist and rule B was not protected (it was a complete token rather than a &quot;subtoken&quot;):
  345. </p>
  346. <pre>B : 'x' ('b' | )
  347. ;</pre>
  348. <p>
  349. In this case, the empty alternative finds only the end of the rule as the lookahead with no other rules referencing it.&nbsp; In the worst case, <strong>any</strong> character could follow this rule (i.e., start the next token or error sequence).&nbsp; So, should not the lookahead for the empty alternative be the entire character vocabulary? &nbsp; And should not this result in a nondeterminism warning as it must conflict with the 'b' alternative?&nbsp; Conceptually, yes to both questions.&nbsp; From a practical standpoint, however, you are clearly saying &quot;heh, match a 'b' on the end of token B if you find one.&quot;&nbsp; I argue that no warning should be generated and ANTLR's policy of matching elements as soon as possible makes sense here as well.
  350. </p>
  351. <p>
  352. Another reason not to represent the lookahead as the entire vocabulary is that a vocabulary of '\u0000'..'\uFFFF' is really big (one set is 2^16 / 32 long words of memory!).&nbsp; Any alternative with '&lt;end-of-token&gt;' in its lookahead set will be pushed to the ELSE or DEFAULT clause by the code generator so that huge bitsets can be avoided.
  353. </p>
  354. <p>
  355. The summary is that lookahead purely derived from hitting the end of a lexical rule (unreferenced by other rules) cannot be the cause of a nondeterminism.&nbsp; The following table summarizes a bunch of cases that will help you figure out when ANTLR will complain and when it will not.
  356. </p>
  357. <table border="1" width="100%">
  358. <tr>
  359. <td valign="top">
  360. <pre>X : 'q' ('a')? ('a')?
  361. ;</pre>
  362. </td>
  363. <td width="100">
  364. The first subrule is nondeterministic as 'a' from second subrule (and end-of-token) are in the lookahead for exit branch of (...)?
  365. </td>
  366. </tr>
  367. <tr>
  368. <td valign="top">
  369. <pre>X : 'q' ('a')? ('c')?
  370. ;</pre>
  371. </td>
  372. <td width="100">
  373. No nondeterminism.
  374. </td>
  375. </tr>
  376. <tr>
  377. <td valign="top">
  378. <pre>Y : 'y' X 'b'
  379. ;
  380. protected
  381. X : 'b'
  382. |
  383. ;</pre>
  384. </td>
  385. <td width="100">
  386. Nondeterminism in rule X.
  387. </td>
  388. </tr>
  389. <tr>
  390. <td valign="top">
  391. <pre>X : 'x' ('a'|'c'|'d')+
  392. | 'z' ('a')+
  393. ;</pre>
  394. </td>
  395. <td width="100">
  396. No nondeterminism as exit branch of loops see lookahead computed purely from end-of-token.
  397. </td>
  398. </tr>
  399. <tr>
  400. <td valign="top">
  401. <pre>Y : 'y' ('a')+ ('a')?
  402. ;</pre>
  403. </td>
  404. <td width="100">
  405. Nondeterminism between 'a' of (...)+ and exit branch as the exit can see the 'a' of the optional subrule.&nbsp; This would be a problem even if ('a')? were simply 'a'.&nbsp; A (...)* loop would report the same problem.
  406. </td>
  407. </tr>
  408. <tr>
  409. <td valign="top">
  410. <pre>X : 'y' ('a' 'b')+ 'a' 'c'
  411. ;</pre>
  412. </td>
  413. <td width="100">
  414. At k=1, this is a nondeterminism for the (...)? since 'a' predicts staying in and exiting the loop.&nbsp; At k=2, no nondeterminism.
  415. </td>
  416. </tr>
  417. <tr>
  418. <td valign="top">
  419. <pre>Q : 'q' ('a' | )?
  420. ;</pre>
  421. </td>
  422. <td width="100">
  423. Here, there is an empty alternative inside an optional subrule.&nbsp; A nondeterminism is reported as two paths predict end-of-token.
  424. </td>
  425. </tr>
  426. </table>
  427. <p>
  428. You might be wondering why the first subrule below is ambiguous:
  429. </p>
  430. <pre>('a')? ('a')?</pre>
  431. <p>
  432. The answer is that the NFA to DFA conversion would result in a DFA with the 'a' transitions merged into a single state transition!&nbsp; This is ok for a DFA where you cannot have actions anywhere except after a complete match.&nbsp; Remember that ANTLR lets you do the following:
  433. </p>
  434. <pre>('a' {do-this})? ('a' {do-that})?</pre>
  435. <p>
  436. One other thing is important to know.&nbsp; Recall that alternatives in lexical rules are reordered according to their lookahead requirements, from highest to lowest.
  437. </p>
  438. <pre>A : 'a'
  439. | 'a' 'b'
  440. ;</pre>
  441. <p>
  442. At k=2, ANTLR can see 'a' followed by '&lt;end-of-token&gt;' for the first alternative and 'a' followed by 'b' in the second.&nbsp; The lookahead at depth 2 for the first alternative being '&lt;end-of-token&gt;' suppressing a warning that depth two can match any character for the first alternative.&nbsp; To behave naturally and to generate good code when no warning is generated, ANTLR reorders the alternatives so that the code generated is similar to:
  443. </p>
  444. <pre>A() {
  445. if ( LA(1)=='a' &amp;&amp; LA(2)=='b' ) { // alt 2
  446. match('a'); match('b');
  447. }
  448. else if ( LA(1)=='a' ) { // alt 1
  449. match('a')
  450. }
  451. else {<em>error</em>;}
  452. }</pre>
  453. <p>
  454. Note the lack of lookahead test for depth 2 for alternative 1.&nbsp; When an empty alternative is present, ANTLR moves it to the end.&nbsp; For example,
  455. </p>
  456. <pre>A : 'a'
  457. |
  458. | 'a' 'b'
  459. ;</pre>
  460. <p>
  461. results in code like this:
  462. </p>
  463. <pre>A() {
  464. if ( LA(1)=='a' &amp;&amp; LA(2)=='b' ) { // alt 2
  465. match('a'); match('b');
  466. }
  467. else if ( LA(1)=='a' ) { // alt 1
  468. match('a')
  469. }
  470. else {
  471. }
  472. }</pre>
  473. <p>
  474. Note that there is no way for a lexing error to occur here (which makes sense because the rule is optional--though this rule only makes sense when <font face="Courier New">protected</font>).
  475. </p>
  476. <p>
  477. Semantic predicates get moved along with their associated alternatives when the alternatives are sorted by lookahead depth.&nbsp; It would be weird if the addition of a {true}? predicate (which implicitly exists for each alternative) changed what the lexer recognized!&nbsp; The following rule is reorder so that alternative 2 is tested for first.
  478. </p>
  479. <pre>B : {true}? 'a'
  480. | 'a' 'b'
  481. ;</pre>
  482. <p>
  483. Syntactic predicates are <strong>not</strong> reordered.&nbsp; Mentioning the predicate after the rule it conflicts with results in an ambiguity such as is in this rule:
  484. </p>
  485. <pre>F : 'c'
  486. | ('c')=&gt; 'c'
  487. ;</pre>
  488. <p>
  489. Other alternatives are, however, reordered with respect to the syntactic predicates even when a switch is generated for the LL(1) components and the syntactic predicates are pushed the default case.&nbsp; The following rule illustrates the point.
  490. </p>
  491. <pre>F : 'b'
  492. | {/* empty-path */}
  493. | ('c')=&gt; 'c'
  494. | 'c'
  495. | 'd'
  496. | 'e'
  497. ;</pre>
  498. <p>
  499. Rule F's decision is generated as follows:
  500. </p>
  501. <pre> switch ( la_1) {
  502. case 'b':
  503. {
  504. match('b');
  505. break;
  506. }
  507. case 'd':
  508. {
  509. match('d');
  510. break;
  511. }
  512. case 'e':
  513. {
  514. match('e');
  515. break;
  516. }
  517. default:
  518. boolean synPredMatched15 = false;
  519. if (((la_1=='c'))) {
  520. int _m15 = mark();
  521. synPredMatched15 = true;
  522. guessing++;
  523. try {
  524. match('c');
  525. }
  526. catch (RecognitionException pe) {
  527. synPredMatched15 = false;
  528. }
  529. rewind(_m15);
  530. guessing--;
  531. }
  532. if ( synPredMatched15 ) {
  533. match('c');
  534. }
  535. else if ((la_1=='c')) {
  536. match('c');
  537. }
  538. else {
  539. if ( guessing==0 ) {
  540. /* empty-path */
  541. }
  542. }
  543. }</pre>
  544. <p>
  545. Notice how the empty path got moved after the test for the 'c' alternative.
  546. </p>
  547. <h3><a id="Scanning_Binary_Files" name="Scanning_Binary_Files"name="Scanning Binary Files">Scanning Binary Files</a></h3>
  548. <p>
  549. Character literals are not limited to printable ASCII characters.&nbsp; To demonstrate the concept, imagine that you want to parse a binary file that contains strings and short integers.&nbsp; To distinguish between them, marker bytes are used according to the following format:
  550. </p>
  551. <table border="1" width="100%">
  552. <tr>
  553. <th width="50%">
  554. format
  555. </th>
  556. <th width="50%" align="center">
  557. description
  558. </th>
  559. </tr>
  560. <tr>
  561. <td width="50%">
  562. '\0' <em>highbyte lowbyte</em>
  563. </td>
  564. <td width="50%" align="center">
  565. Short integer
  566. </td>
  567. </tr>
  568. <tr>
  569. <td width="50%">
  570. '\1' <em>string of non-'\2' chars</em> '\2'
  571. </td>
  572. <td width="50%" align="center">
  573. String
  574. </td>
  575. </tr>
  576. </table>
  577. <p>
  578. Sample input (274 followed by &quot;a test&quot;) might look like the following in hex (output from UNIX <strong>od</strong> <strong>-h</strong> command):
  579. </p>
  580. <pre>0000000000 00 01 12 01 61 20 74 65 73 74 02 </pre>
  581. <p>
  582. or as viewed as characters:
  583. </p>
  584. <pre>0000000000 \0 001 022 001 a t e s t 002</pre>
  585. <p>
  586. The parser is trivially just a (...)+ around the two types of input tokens:
  587. </p>
  588. <pre>class DataParser extends Parser;
  589. file: ( sh:SHORT
  590. {System.out.println(sh.getText());}
  591. | st:STRING
  592. {System.out.println(&quot;\&quot;&quot;+
  593. st.getText()+&quot;\&quot;&quot;);}
  594. )+
  595. ;</pre>
  596. <p>
  597. All of the interesting stuff happens in the lexer.&nbsp; First, define the class and set the vocabulary to be all 8 bit binary values:
  598. </p>
  599. <pre>class DataLexer extends Lexer;
  600. options {
  601. charVocabulary = '\u0000'..'\u00FF';
  602. }</pre>
  603. <p>
  604. Then, define the two tokens according to the specifications, with markers around the string and a single marker byte in front of the short:
  605. </p>
  606. <pre>SHORT
  607. : // match the marker followed by any 2 bytes
  608. '\0' high:. lo:.
  609. {
  610. // pack the bytes into a two-byte short
  611. int v = (((int)high)&lt;&lt;8) + lo;
  612. // make a string out of the value
  613. $setText(&quot;&quot;+v);
  614. }
  615. ;
  616. STRING
  617. : '\1'! // begin string (discard)
  618. ( ~'\2' )*
  619. '\2'! // end string (discard)
  620. ;</pre>
  621. <p>
  622. To invoke the parser, use something like the following:
  623. </p>
  624. <pre>import java.io.*;
  625. class Main {
  626. public static void main(String[] args) {
  627. try {
  628. // use DataInputStream to grab bytes
  629. DataLexer lexer =
  630. new DataLexer(
  631. new DataInputStream(System.in)
  632. );
  633. DataParser parser =
  634. new DataParser(lexer);
  635. parser.file();
  636. } catch(Exception e) {
  637. System.err.println(&quot;exception: &quot;+e);
  638. }
  639. }
  640. }</pre>
  641. <h3><a name="unicode"></a>Scanning Unicode Characters</h3>
  642. <p>
  643. ANTLR (as of 2.7.1) allows you to recognize input composed of Unicode characters; that is, you are not restricted to 8 bit ASCII characters.&nbsp; I would like to emphasize that ANTLR <em>allows</em>, but does yet not <em>support</em> Unicode as there is more work to be done.&nbsp; For example, end-of-file is currently incorrectly specified:
  644. </p>
  645. <pre>CharScanner.EOF_CHAR=(char)-1;</pre>
  646. <p>
  647. This must be an integer -1 not char, which is actually
  648. narrowed to 0xFFFF via the cast. &nbsp; I have to go throught
  649. the entire code base looking for these problems.&nbsp; Plus,
  650. we should really have a special syntax to mean &quot;java
  651. identifier character&quot; and some standard encodings for
  652. non-Western character sets etc... I expect 2.7.3 to add nice
  653. predefined character blocks like <tt>LETTER</tt>.
  654. </p>
  655. <p>
  656. The following is a very simple example of how to match a series of space-separated identifiers.
  657. </p>
  658. <pre>class L extends Lexer;
  659. options {
  660. // Allow any char but \uFFFF (16 bit -1)
  661. charVocabulary='\u0000'..'\uFFFE';
  662. }
  663. {
  664. private static boolean done = false;
  665. public void uponEOF()
  666. throws TokenStreamException, CharStreamException
  667. {
  668. done=true;
  669. }
  670. public static void main(String[] args) throws Exception {
  671. L lexer = new L(System.in);
  672. while ( !done ) {
  673. Token t = lexer.nextToken();
  674. System.out.println(&quot;Token: &quot;+t);
  675. }
  676. }
  677. }
  678. ID : ID_START_LETTER ( ID_LETTER )*
  679. ;
  680. WS : (' '|'\n') {$setType(Token.SKIP);}
  681. ;
  682. protected
  683. ID_START_LETTER
  684. : '$'
  685. | '_'
  686. | 'a'..'z'
  687. | '\u0080'..'\ufffe'
  688. ;
  689. protected
  690. ID_LETTER
  691. : ID_START_LETTER
  692. | '0'..'9'
  693. ;</pre>
  694. <p>
  695. A final note on Unicode.&nbsp; The ~<em>x</em> &quot;not&quot; operator includes everything in your specified vocabulary (up to 16 bit character space) except <em>x</em>. &nbsp; For example,
  696. </p>
  697. <pre>~('$'|'a'..'z')</pre>
  698. <p>
  699. results in every unicode character except '$' and lowercase latin-1 letters, assuming your charVocabulary is 0..FFFF.
  700. </p>
  701. <h3><a id="Manipulating_Token_Text_and_Objects" name="Manipulating_Token_Text_and_Objects"name="Manipulating Token Text and Objects">Manipulating Token Text and Objects</a></h3>
  702. <p>
  703. Once you have specified what to match in a lexical rule, you may ask &quot;what can I discover about what will be matched for each rule element?&quot;&nbsp; ANTLR allows you to label the various elements and, at parse-time, access the text matched for the element. &nbsp; You can even specify the token object to return from the rule and, hence, from the lexer to the parser.&nbsp; This section describes the text and token object handling characteristics of ANTLR.
  704. </p>
  705. <h4><a id="Manipulating_the_Text_of_a_Lexical_Rule" name="Manipulating_the_Text_of_a_Lexical_Rule"name="_bb14">Manipulating the Text of a Lexical Rule</a></h4>
  706. <p>
  707. There are times when you want to look at the text matched for the current rule, alter it, or set the text of a rule to a new string.&nbsp; The most common case is when you want to simply discard the text associated with a few of the elements that are matched for a rule such as quotes.
  708. </p>
  709. <p>
  710. ANTLR provides the '!' operator that lets you indicate certain elements should not contribute to the text for a token being recognized. The '!' operator is used just like when building trees in the parser. For example, if you are matching the HTML tags and you do not want the '&lt;' and '&gt;' characters returned as part of the token text, you could manually remove them from the token's text before they are returned, but a better way is to suffix the unwanted characters with '!'. For example, the &lt;br&gt; tag might be recognized as follows:
  711. </p>
  712. <pre>BR : '&lt;'! &quot;br&quot; '&gt;'! ; // discard &lt; and &gt;</pre>
  713. <p>
  714. Suffixing a lexical rule reference with '!' forces the text matched by the invoked rule to be discarded (it will not appear in the text for the invoking rule).&nbsp; For example, if you do not care about the mantissa of a floating point number, you can suffix the rule that matches it with a '!':
  715. </p>
  716. <pre>FLOAT : INT ('.'! INT!)? ; // keep only first INT</pre>
  717. <p>
  718. As a shorthand notation, you may suffix an alternative or rule with '!' to indicate the alternative or rule should not pass any text back to the invoking rule or parser (if nonprotected):
  719. </p>
  720. <pre>// ! on rule: nothing is auto added to text of rule.
  721. rule! : ... ;
  722. // ! on alt: nothing is auto added to text for alt
  723. rule : ... |! ...;</pre>
  724. <table border="1">
  725. <tr>
  726. <th width="175">
  727. Item suffixed with '!'
  728. </th>
  729. <th>
  730. Effect
  731. </th>
  732. </tr>
  733. <tr>
  734. <td width="175" align="center">
  735. char or string literal
  736. </td>
  737. <td align="left">
  738. Do not add text for this atom to current rule's text.
  739. </td>
  740. </tr>
  741. <tr>
  742. <td width="175" align="center">
  743. rule reference
  744. </td>
  745. <td align="left">
  746. Do not add text for matched while recognizing this rule to current rule's text.
  747. </td>
  748. </tr>
  749. <tr>
  750. <td width="175" align="center">
  751. alternative
  752. </td>
  753. <td align="left">
  754. Nothing that is matched by alternative is added to current rule's text; the enclosing rule contributes nothing to any invoking rule's text.&nbsp; For nonprotected rules, the text for the token returned to parser is blank.
  755. </td>
  756. </tr>
  757. <tr>
  758. <td width="175" align="center">
  759. rule definition
  760. </td>
  761. <td align="left">
  762. Nothing that is matched by <strong>any</strong> alternative is added to current rule's text; the rule contributes nothing to any invoking rule's text.&nbsp; For nonprotected rules, the text for the token returned to parser is blank.
  763. </td>
  764. </tr>
  765. </table>
  766. <p>
  767. While the '!' implies that the text is not added to the text for the current rule, you can label an element to access the text (via the token if the element is a rule reference).
  768. </p>
  769. <p>
  770. In terms of implementation, the characters are always added to the current text buffer, but are carved out when necessary (as this will be the exception rather than the rule, making the normal case efficient).
  771. </p>
  772. <p>
  773. The '!' operator is great for discarding certain characters or groups of characters, but what about the case where you want to insert characters or totally reset the text for a rule or token?&nbsp; ANTLR provides a series of special methods to do this (we prefix the methods with '$' because Java does not have a macro facility and ANTLR must recognize the special methods in your actions).&nbsp; The following table summarizes.
  774. </p>
  775. <table border="1">
  776. <tr>
  777. <th width="175">
  778. Method
  779. </th>
  780. <th>
  781. Description/Translation
  782. </th>
  783. </tr>
  784. <tr>
  785. <td align="center" width="175">
  786. <font face="Courier New">$append(x)</font>
  787. </td>
  788. <td>
  789. Append x to the text of the surrounding rule.&nbsp; Translation: <font face="Courier New">text.append(x)</font>
  790. </td>
  791. </tr>
  792. <tr>
  793. <td align="center" width="175">
  794. <font face="Courier New">$setText(x)</font>
  795. </td>
  796. <td>
  797. Set the text of the surrounding rule to x.&nbsp; Translation: <font face="Courier New">text.setLength(_begin); text.append(x)</font>
  798. </td>
  799. </tr>
  800. <tr>
  801. <td align="center" width="175">
  802. <font face="Courier New">$getText</font>
  803. </td>
  804. <td>
  805. Return a String of the text for the surrounding rule.&nbsp; Translation;
  806. <br>
  807. <font face="Courier New">new String(text.getBuffer(),
  808. <br>
  809. _begin,text.length()-_begin)</font>
  810. </td>
  811. </tr>
  812. <tr>
  813. <td align="center" width="175">
  814. <font face="Courier New">$setToken(x)</font>
  815. </td>
  816. <td>
  817. Set the token object that this rule is to return.&nbsp; See the section on <a href="#Token Object Creation">Token Object Creation</a>. Translation: <font face="Courier New">_token = x</font>
  818. </td>
  819. </tr>
  820. <tr>
  821. <td align="center" width="175">
  822. <font face="Courier New">$setType(x)</font>
  823. </td>
  824. <td>
  825. Set the token type of the surrounding rule.&nbsp; Translation: <font face="Courier New">_ttype = x</font>
  826. </td>
  827. </tr>
  828. <tr>
  829. <td align="center" width="175">
  830. <font face="Courier New">setText(x)</font>
  831. </td>
  832. <td>
  833. <font face="Times New Roman">Set the text for the entire token being recognized </font>regardless of what rule the action is in<font face="Times New Roman">. No translation.</font>
  834. </td>
  835. </tr>
  836. <tr>
  837. <td align="center" width="175">
  838. <font face="Courier New">getText()</font>
  839. </td>
  840. <td>
  841. <font face="Times New Roman">Get the text for the entire token being recognized </font>regardless of what rule the action is <font face="Times New Roman">in. No translation.</font>
  842. </td>
  843. </tr>
  844. </table>
  845. <p>
  846. One of the great things about an ANTLR generated lexer is that the text of a token can be modified incrementally as the token is recognized (an impossible task for a DFA-based lexer):
  847. </p>
  848. <pre>STRING: '&quot;' ( ESCAPE | ~('&quot;'|'\\') )* '&quot;' ;
  849. protected
  850. ESCAPE
  851. : '\\'
  852. ( 'n' { $setText(&quot;\n&quot;); }
  853. | 'r' { $setText(&quot;\r&quot;); }
  854. | 't' { $setText(&quot;\t&quot;); }
  855. | '&quot;' { $setText(&quot;\&quot;&quot;); }
  856. )
  857. ;</pre> <h4><a id="" name=""name="_bb15"></a><a name="Token Object Creation">Token Object Creation</a></h4>
  858. <p>
  859. Because lexical rules can call other rules just like in the parser, you sometimes want to know what text was matched for that portion of the token being matched. To support this, ANTLR allows you to label lexical rules and obtain a <font face="Courier New">Token</font> object representing the text, token type, line number, etc... matched for that rule reference.&nbsp;&nbsp; This ability corresponds to be able to access the text matched for a lexical state in a DFA-based lexer.&nbsp; For example, here is a simple rule that prints out the text matched for a rule reference, INT.
  860. </p>
  861. <pre>INDEX : '[' i:INT ']'
  862. {System.out.println(i.getText());}
  863. ;</pre> <pre>INT : ('0'..'9')+ ;</pre>
  864. <p>
  865. If you moved the labeled reference and action to a parser, it would the same thing (match an integer and print it out).
  866. </p>
  867. <p>
  868. All lexical rules <em>conceptually</em> return a <font face="Courier New">Token</font> object, but in practice this would be inefficient. ANTLR generates methods so that a token object is created only if any invoking reference is labeled (indicating they want the token object).&nbsp; Imagine another rule that calls INT without a label.
  869. </p>
  870. <pre>FLOAT : INT ('.' INT)? ;</pre>
  871. <p>
  872. </font>In this case, no token object is created for either reference to INT<font size="2" face="Courier New">.</font><font face="Times New Roman" size="3">&nbsp; You will notice a boolean argument to every lexical rule that tells it whether or not a token object should be created and returned (via a member variable).&nbsp; </font>All nonprotected rules (those that are &quot;exposed&quot; to the parser) must always generate tokens, which are passed back to the parser.
  873. </p>
  874. <h4><a id="Heterogeneous_Token_Object_Streams" name="Heterogeneous_Token_Object_Streams"name="_bb16">Heterogeneous Token Object Streams</a></h4>
  875. <p>
  876. While token creation is normally handled automatically, you can also manually specify the token object to be returned from a lexical rule. The advantage is that you can pass heterogeneous token objects back to the parser, which is extremely useful for parsing languagues with complicated tokens such as HTML (the <font face="Courier New">&lt;img&gt;</font> and <font face="Courier New">&lt;table&gt;</font> tokens, for example, can have lots of attributes).&nbsp; Here is a rule for the &lt;img&gt; tag that returns a token object of type ImageToken:
  877. </p>
  878. <pre>IMAGE
  879. {
  880. Attributes attrs;
  881. }
  882. : &quot;&lt;img &quot; attrs=ATTRIBUTES '&gt;'
  883. {
  884. ImageToken t = new ImageToken(IMAGE,$getText);
  885. t.setAttributes(attrs);
  886. $setToken(t);
  887. }
  888. ;
  889. ATTRIBUTES returns [Attributes a]
  890. : ...
  891. ;</pre>
  892. <p>
  893. The <font face="Courier New">$setToken</font> function specifies that its argument is to be returned when the rule exits.&nbsp; The parser will receive this specific object instead of a <font face="Courier New">CommonToken</font> or whatever else you may have specified with the <font face="Courier New">Lexer.setTokenObjectClass</font> method. &nbsp; The action in rule <font face="Courier New">IMAGE</font> references a token type, <font face="Courier New">IMAGE</font>, and a lexical rule references, <font face="Courier New">ATTRIBUTES</font>, which matches all of the attributes of an image tag and returns them in a data structure called <font face="Courier New">Attributes</font>.
  894. </p>
  895. <p>
  896. What would it mean for rule <font face="Courier New">IMAGE</font> to be protected (i.e., referenced only from other lexical rules rather than from <font face="Courier New">nextToken</font>)? &nbsp; Any invoking labeled rule reference would receive the object (not the parser) and could examine it, or manipulate it, or pass it on to the invoker of that rule.&nbsp; For example, if <font face="Courier New">IMAGE</font> were called from <font face="Courier New">TAGS</font> rather than being nonprotected, rule <font face="Courier New">TAGS</font> would have to pass the token object back to the parser for it.
  897. </p>
  898. <pre>TAGS : IMG:IMAGE
  899. {$setToken(img);} // pass to parser
  900. | PARAGRAPH // probably has no special token
  901. | ...
  902. ;</pre>
  903. <p>
  904. Setting the token object for a nonprotected rule invoked without a label has no effect other than to waste time creating an object that will not be used.
  905. </p>
  906. <p>
  907. We use a <tt>CharScanner</tt> member <tt>_returnToken</tt> to do the return in order to not conflict with return values used by the grammar developer. For example,
  908. </p>
  909. <pre>PTAG: &quot;&lt;p&gt;&quot; {$setToken(new ParagraphToken($$));} ; </pre>
  910. <p>
  911. which would be translated to something like:
  912. </p>
  913. <pre>protected final void mPTAG()
  914. throws RecognitionException, CharStreamException,
  915. TokenStreamException {
  916. Token _token = null;
  917. match(&quot;&lt;p&gt;&quot;);
  918. _returnToken =
  919. new ParagraphToken(<em>text-of-current-rule</em>);
  920. }</pre> <h3><a id="" name=""name="_bb17"></a><a name="Filtering Input Streams">Filtering Input Streams</a></h3>
  921. <p>
  922. You often want to perform an action upon seeing a pattern or two in a complicated input stream, such as pulling out links in an HTML file.&nbsp; One solution is to take the HTML grammar and just put actions where you want.&nbsp; Using a complete grammar is overkill and you may not have a complete grammar to start with.
  923. </p>
  924. <p>
  925. ANTLR provides a mechanism similar to AWK that lets you say &quot;here are the patterns I'm interested in--ignore everything else.&quot;&nbsp; Naturally, AWK is limited to regular expressions whereas ANTLR accepts context-free grammars (Uber-AWK?).&nbsp; For example, consider pulling out the &lt;p&gt; and &lt;br&gt; tags from an arbitrary HTML file.&nbsp; Using the filter option, this is easy:
  926. </p>
  927. <pre>class T extends Lexer;
  928. options {
  929. k=2;
  930. filter=true;
  931. }
  932. P : &quot;&lt;p&gt;&quot; ;
  933. BR: &quot;&lt;br&gt;&quot; ;</pre>
  934. <p>
  935. In this &quot;mode&quot;, there is no possibility of a syntax error.&nbsp; Either the pattern is matched exactly or it is filtered out.
  936. </p>
  937. <p>
  938. This works very well for many cases, but is not sophisticated enough to handle the situation where you want &quot;almost matches&quot; to be reported as errors. &nbsp; Consider the addition of the &lt;table...&gt; tag to the previous grammar:
  939. </p>
  940. <pre>class T extends Lexer;
  941. options {
  942. k=2;
  943. filter = true;
  944. }
  945. P : &quot;&lt;p&gt;&quot; ;
  946. BR: &quot;&lt;br&gt;&quot; ;
  947. TABLE : &quot;&lt;table&quot; (WS)? (ATTRIBUTE)* (WS)? '&gt;' ;
  948. WS : ' ' | '\t' | '\n' ;
  949. ATTRIBUTE : ... ;</pre>
  950. <p>
  951. Now, consider input &quot;&lt;table 8 = width ;&gt;&quot; (a bogus table definition). As is, the lexer would simply scarf past this input without &quot;noticing&quot; the invalid table. What if you want to indicate that a bad table definition was found as opposed to ignoring it?&nbsp; Call method
  952. </p>
  953. <pre>setCommitToPath(boolean commit)</pre>
  954. <p>
  955. in your TABLE rule to indicate that you want the lexer to commit to recognizing the table tag:
  956. </p>
  957. <pre>TABLE
  958. : &quot;&lt;table&quot; (WS)?
  959. {setCommitToPath(true);}
  960. (ATTRIBUTE)* (WS)? '&gt;'
  961. ;</pre>
  962. <p>
  963. Input &quot;&lt;table 8 = width ;&gt;&quot; would result in a syntax error.&nbsp; Note the placement after the whitespace recognition; you do not want &lt;tabletop&gt; reported as a bad table (you want to ignore it).
  964. </p>
  965. <p>
  966. One further complication in filtering: What if the &quot;skip language&quot; (the stuff in between valid tokens or tokens of interest) cannot be correctly handled by simply consuming a character and trying again for a valid token?&nbsp; You may want to ignore comments or strings or whatever.&nbsp; In that case, you can specify a rule that scarfs anything between tokens of interest by using option <font face="Courier New">filter=<em>RULE</em></font>. &nbsp; For example, the grammar below filters for &lt;p&gt; and &lt;br&gt; tags as before, but also prints out any other tag (&lt;...&gt;) encountered.
  967. </p>
  968. <pre>class T extends Lexer;
  969. options {
  970. k=2;
  971. filter=IGNORE;
  972. charVocabulary = '\3'..'\177';
  973. }
  974. P : &quot;&lt;p&gt;&quot; ;
  975. BR: &quot;&lt;br&gt;&quot; ;
  976. protected
  977. IGNORE
  978. : '&lt;' (~'&gt;')* '&gt;'
  979. {System.out.println(&quot;bad tag:&quot;+$getText);}
  980. | ( &quot;\r\n&quot; | '\r' | '\n' ) {newline();}
  981. | .
  982. ;</pre>
  983. <p>
  984. Notice that the filter rule must track newlines in the general case where the lexer might emit error messages so that the line number is not stuck at 0.
  985. </p>
  986. <p>
  987. The filter rule is invoked either when the lookahead (in nextToken) predicts none of the nonprotected lexical rules or when one of those rules fails.&nbsp; In the latter case, the input is rolled back before attempting the filter rule.&nbsp; Option <font face="Courier New">filter=true</font> is like having a filter rule such as:
  988. </p>
  989. <pre>IGNORE : . ;</pre>
  990. <p>
  991. Actions in regular lexical rules are executed even if the rule fails and the filter rule is called.&nbsp; To do otherwise would require every valid token to be matched twice (once to match and once to do the actions like a syntactic predicate)! Plus, there are few actions in lexer rules (usually they are at the end at which point an error cannot occur).
  992. </p>
  993. <p>
  994. Is the filter rule called when commit-to-path is true and an error is found in a lexer rule? No, an error is reported as with filter=true.
  995. </p>
  996. <p>
  997. What happens if there is a syntax error in the filter rule?&nbsp; Well, you can either put an exception handler on the filter rule or accept the default behavior, which is to consume a character and begin looking for another valid token.
  998. </p>
  999. <p>
  1000. In summary, the filter option allows you to:
  1001. <ol>
  1002. <li>
  1003. Filter like awk (only perfect matches reported--no such thing as syntax error)
  1004. </li>
  1005. <li>
  1006. Filter like awk + catch poorly-formed matches (that is, &quot;almost matches&quot; like &lt;table 8=3;&gt; result in an error)
  1007. </li>
  1008. <li>
  1009. Filter but specify the skip language
  1010. </li>
  1011. </ol>
  1012. <h4><a id="ANTLR_Masquerading_as_SED" name="ANTLR_Masquerading_as_SED"name="ANTLR Masquerading as SED">ANTLR Masquerading as SED</a></h4>
  1013. <p>
  1014. To make ANTLR generate lexers that behave like the UNIX utility sed (copy standard in to standard out except as specified by the replace patterns), use a filter rule that does the input to output copying:
  1015. </p>
  1016. <font size="2"><pre></font><font size="3"><font face="Courier New">class T extends Lexer;
  1017. options {
  1018. k=2;
  1019. filter=IGNORE;
  1020. charVocabulary = '\3'..'\177';
  1021. }</font></pre> <pre><font face="Courier New">P : &quot;&lt;p&gt;&quot; {System.out.print(&quot;&lt;P&gt;&quot;);};
  1022. BR : &quot;&lt;br&gt;&quot; {System.out.print(&quot;&lt;BR&gt;&quot;);};</font></pre> <pre><font face="Courier New">protected
  1023. IGNORE
  1024. : ( &quot;\r\n&quot; | '\r' | '\n' )
  1025. {newline(); System.out.println(&quot;&quot;);}
  1026. | c:. {System.out.print(c);}
  1027. ;</font></pre> </font>
  1028. <p>
  1029. This example dumps anything other than &lt;p&gt; and &lt;br&gt; tags to standard out and pushes lowercase &lt;p&gt; and &lt;br&gt; to uppercase. Works great.
  1030. </p>
  1031. <h4><a id="Nongreedy_Subrules" name="Nongreedy_Subrules"name="Nongreedy Subrules">Nongreedy Subrules</a></h4>
  1032. <p>
  1033. Quick:&nbsp; What does the following match?
  1034. </p>
  1035. <pre>BLOCK : '{' (.)* '}';</pre>
  1036. <p>
  1037. Your first reaction is that it matches any set of characters inside of curly quotes. &nbsp; In reality, it matches '{' followed by every single character left on the input stream!&nbsp; Why?&nbsp; Well, because ANTLR loops are <em>greedy</em>--they consume as much input as they can match.&nbsp; Since the wildcard matches any character, it consumes the '}' and beyond.&nbsp; This is a pain for matching strings, comments and so on.
  1038. </p>
  1039. <p>
  1040. Why can't we switch it around so that it consumes only until it sees something on the input stream that matches what <strong>follows</strong> the loop, such as the '}'? &nbsp; That is, why can't we make loops <em>nongreedy</em>?&nbsp; The answer is we can, but sometimes you want greedy and sometimes you want nongreedy (PERL has both kinds of closure loops now too).&nbsp; Unfortunately, parsers usually want greedy and lexers usually want nongreedy loops.&nbsp; Rather than make the same syntax behave differently in the various situations, Terence decided to leave the semantics of loops as they are (greedy) and make a subrule option to make loops nongreedy.
  1041. </p>
  1042. <h4><a id="Greedy_Subrules" name="Greedy_Subrules"name="Greedy Parser Subrules">Greedy Subrules</a></h4>
  1043. <p>
  1044. I have yet to see a case when building a parser grammar where I did not want a subrule to match as much input as possible.&nbsp; For example, the solution to the classic if-then-else clause ambiguity is to match the &quot;else&quot; as soon as possible:
  1045. </p>
  1046. <pre>stat : &quot;if&quot; expr &quot;then&quot; stat (&quot;else&quot; stat)?
  1047. | ...
  1048. ;</pre>
  1049. <p>
  1050. This ambiguity (which statement should the &quot;else&quot; be attached to) results in a parser nondeterminism.&nbsp; ANTLR warns you about the <font face="Courier New">(...)?</font> subrule as follows:
  1051. </p>
  1052. <pre>warning: line 3: nondeterminism upon
  1053. k==1:&quot;else&quot;
  1054. between alts 1 and 2 of block</pre>
  1055. <p>
  1056. If, on the other hand, you make it clea

Large files files are truncated, but you can click here to view the full file