boo-lang /lib/antlr-2.7.5/doc/lexer.html

Language HTML Lines 1600
MD5 Hash b885c19a7b017b3a0a0177f91d2a75cf
Repository https://github.com/boo/boo-lang.git View Raw File View Project SPDX
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
	<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
	<title>Lexical Analysis with ANTLR</title> 
</head>
<body bgcolor="#FFFFFF">
<h2><a id="Lexical_Analysis_with_ANTLR"  name="Lexical_Analysis_with_ANTLR"name="_bb1">Lexical Analysis with ANTLR</a></h2> 
<p>
	A <em>lexer</em> (often called a scanner) breaks up an input stream of characters into vocabulary symbols for a parser, which applies a grammatical structure to that symbol stream. Because ANTLR employs the same recognition mechanism for lexing, parsing, and tree parsing, ANTLR-generated lexers are much stronger than DFA-based lexers such as those generated by DLG (from PCCTS 1.33) and lex.
</p>
<p>
	The increase in lexing power comes at the cost of some inconvenience in lexer specification and indeed requires a serious shift your thoughts about lexical analysis. See a <a href="lexer.html#dfacompare">comparison of LL(k) and DFA-based lexical analysis</a>.
</p>
<p>
	ANTLR generates predicated-LL(k) lexers, which means that you can have semantic and syntactic predicates and use k&gt;1 lookahead. The other advantages are:
<ul>
	<li>
		You can actually read and debug the output as its very similar to what you would build by hand.
	</li>
	<li>
		The syntax for specifying lexical structure is the same for lexers, parsers, and tree parsers.
	</li>
	<li>
		You can have actions executed during the recognition of a single token.
	</li>
	<li>
		You can recognize complicated tokens such as HTML tags or &quot;executable&quot; comments like the javadoc <font face="Courier New">@</font>-tags inside <font size="2" face="Courier New">/** ... */</font> comments. The lexer has a stack, unlike a DFA, so you can match nested structures such as nested comments.
	</li>
</ul>
<p>
	The overall structure of a lexer is:
</p>
<pre>class MyLexer extends Lexer;
options {
  <em>some options</em>
}
{<em>
  lexer class members</em>
}
<em>lexical rules</em></pre>

<h3><a id=""  name=""name="_bb2"></a><a name="lexicalrules">Lexical Rules</a></h3> 
<p>
	Rules defined within a lexer grammar must have a name beginning with an uppercase letter. These rules implicitly match characters on the input stream instead of tokens on the token stream. Referenced grammar elements include token references (implicit lexer rule references), characters, and strings. Lexer rules are processed in the exact same manner as parser rules and, hence, may specify arguments and return values; further, lexer rules can also have local variables and use recursion. The following rule defines a rule called <font size="2" face="Courier New">ID</font> that is available as a token type in the parser.
</p>
<pre>ID : ( 'a'..'z' )+
   ;</pre> 
<p>
	This rule would become part of the resulting lexer and would appear as a method called <font size="2" face="Courier New">mID()</font> that looks sort of like this:

<tt><pre>
    public final void mID(...)
        throws RecognitionException,
               CharStreamException, TokenStreamException
    {
        ...
        _loop3:
        do {
            if (((LA(1) >= 'a' && LA(1) <= 'z'))) {
                matchRange('a','z');
            }
        } while (...);
        ...
    }
</pre></tt>

<p>
It is a good idea to become familiar with ANTLR's output--the generated lexers are human-readable and make a lot of concepts more transparent.

<h4><a id="Skipping_characters"  name="Skipping_characters"name="_bb4">Skipping characters</a></h4> 
<p>
	To have the characters matched by a rule ignored, set the token type to<font size="2" face="Courier New"> Token.SKIP</font>. For example,
</p>
<pre>WS : ( ' ' | '\t' | '\n' { newline(); } | '\r' )+
     { $setType(Token.SKIP); }
   ;</pre>

Skipped tokens force the lexer to reset and try for another
token.  Skipped tokens are never sent back to the parser.

<h4><a id="Distinguishing_between_lexer_rules"  name="Distinguishing_between_lexer_rules"name="_bb5">Distinguishing between lexer rules</a></h4> 

<p>As with most lexer generators like <tt>lex</tt>, you simply list a
set of lexical rules that match tokens.  The tool then automatically
generates code to map the next input character(s) to a rule likely to
match.  Because ANTLR generates recursive-descent lexers just like it
does for parsers and tree parsers, ANTLR automatically generates a
method for a fictitious rule called <tt>nextToken</tt> that predicts
which of your lexer rules will match upon seeing the character
lookahead.  You can think of this method as just a big "switch" that
routes recognition flow to the appropriate rule (the code may be much
more complicated than a simple <tt>switch</tt>-statement, however).
Method <tt>nextToken</tt> is the only method of <tt>TokenStream</tt>
(in Java):

<tt><pre>
public interface TokenStream {
    public Token nextToken() throws TokenStreamException;
}
</pre></tt>

A parser feeds off a lookahead buffer and the buffer pulls from any
<tt>TokenStream</tt>.

Consider the following two ANTLR lexer rules:

<tt><pre>
INT : ('0'..'9')+;
WS : ' ' | '\t' | '\r' | '\n';
</pre></tt>

<p>
You will see something like the following method in lexer generated by
ANTLR:

<tt><pre>
public Token nextToken() throws TokenStreamException {
    ...
    for (;;) {
        Token _token = null;
        int _ttype = Token.INVALID_TYPE;
        resetText();
        ...
        switch (LA(1)) {
          case '0': case '1': case '2': case '3':
          case '4': case '5': case '6': case '7':
          case '8': case '9':
            mINT(); break;
          case '\t': case '\n': case '\r': case ' ':
            mWS(); break;
          default: // error
        }
        ...
    }
}
</pre></tt>

<p> <b>What happens when the same character predicts more than a single
lexical rule</b>?  ANTLR generates an nondeterminism warning between the
offending rules, indicating you need to make sure your rules do not
have common left-prefixes.  ANTLR does not follow the common lexer
rule of &quot;first definition wins&quot; (the alternatives within a
rule, however, still follow this rule). Instead, sufficient power is
given to handle the two most common cases of ambiguity, namely
&quot;keywords vs. identifiers&quot;, and &quot;common prefixes&quot;;
and for especially nasty cases you can use syntactic or semantic
predicates.</p>

<p> <b>What if you want to break up the definition of a complicated
rule into multiple rules</b>? Surely you don't want every rule to
result in a complete Token object in this case. Some rules are only
around to help other rules construct tokens. To distinguish these
"helper" rules from rules that result in tokens, use the
<tt>protected</tt> modifier. This overloading of the access-visibility
Java term occurs because if the rule is not visible, it cannot be
"seen" by the parser (yes, this nomeclature sucks).  See also <a
href="http://www.jguru.com/faq/view.jsp?EID=125"><b>What is a
"protected" lexer rule</b></a>.

<p>
Another, more practical, way to look at this is to note that only
non-protected rules get called by <tt>nextToken</tt> and, hence, only
non-protected rules can generate tokens that get shoved down the
TokenStream pipe to the parser.


<h4><a id="Return_values"  name="Return_values"name="_bb3">Return values</a></h4> 
<p>
	All rules return a token object (conceptually) automatically, which contains the text matched for the rule and its token type at least.&nbsp; To specify a user-defined return value, define a return value and set it in an action:
</p>
<pre>protected
INT returns [int v]
    :   (&#145;0&#146;..&#146;9&#146;)+ { v=Integer.valueOf($getText); }
    ;</pre> 
<p>
Note that only protected rules can have a return type since regular lexer rules generally are invoked by <tt>nextToken()</tt> and the parser cannot access the return value, leading to confusion.
</p>

<h3><a id="Predicated-LL(k)_Lexing"  name="Predicated-LL(k)_Lexing"name="_predllk">Predicated-LL(k) Lexing</a></h3> 
<p>

<p>
	Lexer rules allow your parser to match <i>context-free</i> structures on the input character stream as opposed to the much weaker <i>regular</i> structures (using a DFA--deterministic finite automaton). For example, consider that matching nested curly braces with a DFA must be done using a counter whereas nested curlies are trivially matched with a context-free grammar: 
</p>
<pre><tt>ACTION
    :   '{' ( ACTION | ~'}' )* '}'
    ;</tt>    </pre> 
<p>
The recursion from rule ACTION to ACTION, of course, is the dead giveaway that this is not an ordinary lexer rule. 
</p>
<p>
	Because the same algorithms are used to analyze lexer and parser rules, lexer rules may use more than a single symbol of lookahead, can use semantic predicates, and can specify syntactic predicates to look arbitrarily ahead, thus, providing recognition capabilities beyond the LL(k) languages into the <i>context-sensitive</i>. Here is a simple example that requires k&gt;1 lookahead: 
</p>
<pre><tt>ESCAPE_CHAR
    :   '\\' 't' // two char of lookahead needed,
    |   '\\' 'n' // due to common left-prefix
    ;</tt>    </pre> 
<p>
	To illustrate the use of syntactic predicates for lexer rules, consider the problem of distinguishing between floating point numbers and ranges in Pascal. Input <tt>3..4</tt> must be broken up into 3 tokens: <tt>INT</tt>, <tt>RANGE</tt>, followed by <tt>INT</tt>. Input <tt>3.4</tt>, on the other hand, must be sent to the parser as a <tt>REAL</tt>. The trouble is that the series of digits before the first <tt>'.'</tt> can be arbitrarily long. The scanner then must consume the first <tt>'.'</tt> to see if the next character is a <tt>'.'</tt>, which would imply that it must back up and consider the first series of digits an integer. Using a non-backtracking lexer makes this task very difficult; without bracktracking, your lexer has to be able to respond with more than a single token at one time. However, a syntactic predicate can be used to specify what arbitrary lookahead is necessary: 
</p>
<pre><tt>class Pascal extends Parser;

prog:   INT
        (   RANGE INT
            { System.out.println(&quot;INT .. INT&quot;); }
        |   EOF
            { System.out.println(&quot;plain old INT&quot;); }
        )
    |   REAL { System.out.println(&quot;token REAL&quot;); }
    ;

class LexPascal extends Lexer;

WS  :   (' '
    |   '\t'
    |   '\n'
    |   '\r')+
        { $setType(Token.SKIP); }
    ;

protected
INT :   ('0'..'9')+
    ;

protected
REAL:   INT '.' INT
    ;

RANGE
    :   &quot;..&quot;
    ;

RANGE_OR_INT
    :   ( INT &quot;..&quot; ) =&gt; INT  { $setType(INT); }
    |   ( INT '.' )  =&gt; REAL { $setType(REAL); }
    |   INT                  { $setType(INT); }
    ;</tt>    </pre> 
<p>
	ANTLR lexer rules are even able to handle FORTRAN assignments and other difficult lexical constructs. Consider the following <tt>DO</tt> loop: 
</p>
<pre><tt>DO 100 I = 1,10</tt></pre> 
<p>
	If the comma were replaced with a period, the loop would become an assignment to a weird variable called &quot;<tt>DO100I</tt>&quot;: 
</p>
<pre><tt>DO 100 I = 1.10</tt></pre> 
<p>
	The following rules correctly differentiate the two cases:
</p>
<pre>DO_OR_VAR
    :   (DO_HEADER)=&gt; &quot;DO&quot; { <tt>$setType(</tt>DO); }
    |   VARIABLE { <tt>$setType(</tt>VARIABLE); }
    ;

protected
DO_HEADER
options { ignore=WS; }
    :   &quot;DO&quot; INT VARIABLE '=' EXPR ','
    ;

protected INT : ('0'..'9')+;

protected WS : ' ';

protected
VARIABLE
    :   'A'..'Z'
        ('A'..'Z' | ' ' | '0'..'9')*
        { /* strip space from end */ }
    ;

// just an int or float
protected EXPR
    :   INT ( '.' (INT)? )?
    ;
</pre>

<p> The previous examples discuss differentiating lexical rules via
lots of lookahead (fixed k or arbitrary).  There are other situations
where you have to turn on and off certain lexical rules (making
certain tokens valid and invalid) depending on prior context or
semantic information.  One of the best examples is matching a token
only if it starts on the left edge of a line (i.e., column 1).
Without being able to test the state of the lexer's column counter,
you cannot do a decent job.  Here is a simple <tt>DEFINE</tt> rule
that is only matched if the semantic predicate is true.

<tt><pre>
DEFINE
    :   {getColumn()==1}? "#define" ID
    ;
</pre></tt>

<p> Semantic predicates on the <b>left-edge</b> of
<b>single-alternative</b> lexical rules get hoisted into the
<tt>nextToken</tt> prediction mechanism.  Adding the predicate to a
rule makes it so that it is not a candidate for recognition until the
predicate evaluates to true.  In this case, the method for
<tt>DEFINE</tt> would never be entered, even if the lookahead
predicted <tt>#define</tt>, if the column &gt; 1.

<p> Another useful example involves context-sensitive recognition such
as when you want to match a token only if your lexer is in a particular
context (e.g., the lexer previously matched some trigger sequence).  If
you are matching tokens that separate rows of data such as
"<tt>----</tt>", you probably only want to match this if the "begin
table" sequence has been found.

<tt><pre>
BEGIN_TABLE
    :   '[' {this.inTable=true;} // enter table context
    ;

ROW_SEP
    :   {this.inTable}? "----"
    ;

END_TABLE
    :   ']' {this.inTable=false;} // exit table context
    ;
</pre></tt>

This predicate hoisting ability is another way to simulate lexical
states from DFA-based lexer generators like <tt>lex</tt>, though
predicates are much more powerful.  (You could even turn on certain
rules according to the phase of the moon). ;)

<h3><a id="Keywords_and_literals"  name="Keywords_and_literals"name="_bb7">Keywords and literals</a></h3> 
<p>
	Many languages have a general &quot;identifier&quot; lexical rule, and keywords that are special cases of the identifier pattern. A typical identifier token is defined as: 
</p>
<pre><tt>ID : LETTER (LETTER | DIGIT)*;</tt></pre> 
<p>
	This is often in conflict with keywords. ANTLR solves this problem by letting you put fixed keywords into a literals table. The literals table (which is usally implemented as a hash table in the lexer) is checked after each token is matched, so that the literals effectively override the more general identifier pattern. Literals are created in one of two ways. First, any double-quoted string used in a parser is automatically entered into the literals table of the associated lexer. Second, literals may be specified in the lexer grammar by means of the <a href="options.html#literal">literal option</a>. In addition, the <a href="options.html#testLiterals">testLiterals option</a> gives you fine-grained control over the generation of literal-testing code. 
</p>
<h3><a id="Common_prefixes"  name="Common_prefixes"name="_bb8">Common prefixes</a></h3> 
<p>
	Fixed-length common prefixes in lexer rules are best handled by increasing the <a href="options.html#k">lookahead depth</a> of the lexer. For example, some operators from Java: 
</p>
<pre><tt>class MyLexer extends Lexer;
options {
  k=4;
}
GT : &quot;&gt;&quot;;
GE : &quot;&gt;=&quot;;
RSHIFT : &quot;&gt;&gt;&quot;;
RSHIFT_ASSIGN : &quot;&gt;&gt;=&quot;;
UNSIGNED_RSHIFT : &quot;&gt;&gt;&gt;&quot;;
UNSIGNED_RSHIFT_ASSIGN : &quot;&gt;&gt;&gt;=&quot;;</tt></pre>

 <h3><a id="Token_definition_files"  name="Token_definition_files"name="_bb9">Token definition files</a></h3> 
<p>
	Token definitions can be transferred from one grammar to another by way of token definition files. This is accomplished using the <a href="options.html#importVocab">importVocab</a> and <a href="options.html#exportVocab">exportVocab</a> options.
</p>
<h3><a id="Character_classes"  name="Character_classes"name="_bb10">Character classes</a></h3> 
<p>
	Use the <font face="Courier New">~</font> operator to invert a character or set of characters.&nbsp; For example, to match any character other than newline, the following rule references ~'\n'.
</p>
<pre>SL_COMMENT: &quot;//&quot; (~'\n')* '\n';</pre> 
<p>
	The <font face="Courier New">~</font> operator also inverts a character set:
</p>
<pre>NOT_WS: ~(' ' | '\t' | '\n' | '\r');</pre> 
<p>
	The range operator can be used to create sequential character sets:
</p>
<pre>DIGIT : '0'..'9' ;</pre>

<h3><a id="Token_Attributes"  name="Token_Attributes"name="_bb11">Token Attributes</a></h3> 
<p>
	See the next section.
</p>
<h3><a id=""  name=""name="_bb12"></a><a name="lexicallookahead">Lexical lookahead and the end-of-token symbol</a></h3> 
<p>
	A unique situation occurs when analyzing lexical grammars, one which is similar to the end-of-file condition when analyzing regular grammars.&nbsp; Consider how you would compute lookahead sets for the ('b' | ) subrule in following rule B:
</p>
<pre>class L extends Lexer;

A	:	B 'b'
	;

protected  // only called from another lex rule
B	:	'x' ('b' | )
	;</pre> 
<p>
	The lookahead for the first alternative of the subrule is clearly 'b'.&nbsp; The second alternative is empty and the lookahead set is the set of all characters that can follow references to the subrule, which is the follow set for rule B.&nbsp; In this case, the 'b' character follows the reference to B and is therefore the lookahead set for the empty alt indirectly.&nbsp; Because 'b' begins both alternatives, the parsing decision for the subrule is nondeterminism or ambiguous as we sometimes say.&nbsp; ANTLR will justly generate a warning for this subrule (unless you use the <font face="Courier New">warnWhenFollowAmbig</font> option).
</p>
<p>
	Now, consider what would make sense for the lookahead if rule A did not exist and rule B was not protected (it was a complete token rather than a &quot;subtoken&quot;):
</p>
<pre>B	:	'x' ('b' | )
	;</pre> 
<p>
	In this case, the empty alternative finds only the end of the rule as the lookahead with no other rules referencing it.&nbsp; In the worst case, <strong>any</strong> character could follow this rule (i.e., start the next token or error sequence).&nbsp; So, should not the lookahead for the empty alternative be the entire character vocabulary? &nbsp; And should not this result in a nondeterminism warning as it must conflict with the 'b' alternative?&nbsp; Conceptually, yes to both questions.&nbsp; From a practical standpoint, however, you are clearly saying &quot;heh, match a 'b' on the end of token B if you find one.&quot;&nbsp; I argue that no warning should be generated and ANTLR's policy of matching elements as soon as possible makes sense here as well.
</p>
<p>
	Another reason not to represent the lookahead as the entire vocabulary is that a vocabulary of '\u0000'..'\uFFFF' is really big (one set is 2^16 / 32 long words of memory!).&nbsp; Any alternative with '&lt;end-of-token&gt;' in its lookahead set will be pushed to the ELSE or DEFAULT clause by the code generator so that huge bitsets can be avoided.
</p>
<p>
	The summary is that lookahead purely derived from hitting the end of a lexical rule (unreferenced by other rules) cannot be the cause of a nondeterminism.&nbsp; The following table summarizes a bunch of cases that will help you figure out when ANTLR will complain and when it will not.
</p>
<table border="1" width="100%">
	<tr>
		<td valign="top">
<pre>X	:	'q' ('a')? ('a')?
        ;</pre> 
		</td>
		<td width="100">
			The first subrule is nondeterministic as 'a' from second subrule (and end-of-token) are in the lookahead for exit branch of (...)?
		</td>
	</tr>
	<tr>
		<td valign="top">
<pre>X	:	'q' ('a')? ('c')?
        ;</pre> 
		</td>
		<td width="100">
			No nondeterminism.
		</td>
	</tr>
	<tr>
		<td valign="top">
<pre>Y	:    'y' X 'b'
	;

protected
X	:    'b'
	|
	;</pre> 
		</td>
		<td width="100">
			Nondeterminism in rule X.
		</td>
	</tr>
	<tr>
		<td valign="top">
<pre>X	:	'x' ('a'|'c'|'d')+
	|	'z' ('a')+
	;</pre> 
		</td>
		<td width="100">
			No nondeterminism as exit branch of loops see lookahead computed purely from end-of-token.
		</td>
	</tr>
	<tr>
		<td valign="top">
<pre>Y	:	'y' ('a')+ ('a')?
	;</pre> 
		</td>
		<td width="100">
			Nondeterminism between 'a' of (...)+ and exit branch as the exit can see the 'a' of the optional subrule.&nbsp; This would be a problem even if ('a')? were simply 'a'.&nbsp; A (...)* loop would report the same problem.
		</td>
	</tr>
	<tr>
		<td valign="top">
<pre>X	:	'y' ('a' 'b')+ 'a' 'c'
	;</pre> 
		</td>
		<td width="100">
			At k=1, this is a nondeterminism for the (...)? since 'a' predicts staying in and exiting the loop.&nbsp; At k=2, no nondeterminism.
		</td>
	</tr>
	<tr>
		<td valign="top">
<pre>Q	:	'q' ('a' | )?
	;</pre> 
		</td>
		<td width="100">
			Here, there is an empty alternative inside an optional subrule.&nbsp; A nondeterminism is reported as two paths predict end-of-token.
		</td>
	</tr>
</table>
<p>
	You might be wondering why the first subrule below is ambiguous:
</p>
<pre>('a')? ('a')?</pre> 
<p>
	The answer is that the NFA to DFA conversion would result in a DFA with the 'a' transitions merged into a single state transition!&nbsp; This is ok for a DFA where you cannot have actions anywhere except after a complete match.&nbsp; Remember that ANTLR lets you do the following:
</p>
<pre>('a' {do-this})? ('a' {do-that})?</pre> 
<p>
	One other thing is important to know.&nbsp; Recall that alternatives in lexical rules are reordered according to their lookahead requirements, from highest to lowest.
</p>
<pre>A	:	'a'
	|	'a' 'b'
	;</pre> 
<p>
	At k=2, ANTLR can see 'a' followed by '&lt;end-of-token&gt;' for the first alternative and 'a' followed by 'b' in the second.&nbsp; The lookahead at depth 2 for the first alternative being '&lt;end-of-token&gt;' suppressing a warning that depth two can match any character for the first alternative.&nbsp; To behave naturally and to generate good code when no warning is generated, ANTLR reorders the alternatives so that the code generated is similar to:
</p>
<pre>A() {
	if ( LA(1)=='a' &amp;&amp; LA(2)=='b' ) { // alt 2
		match('a'); match('b');
	}
	else if ( LA(1)=='a' ) { // alt 1
		match('a')
	}
	else {<em>error</em>;}
}</pre> 
<p>
	Note the lack of lookahead test for depth 2 for alternative 1.&nbsp; When an empty alternative is present, ANTLR moves it to the end.&nbsp; For example,
</p>
<pre>A	:	'a'
	|
	|	'a' 'b'
	;</pre> 
<p>
	results in code like this:
</p>
<pre>A() {
	if ( LA(1)=='a' &amp;&amp; LA(2)=='b' ) { // alt 2
		match('a'); match('b');
	}
	else if ( LA(1)=='a' ) { // alt 1
		match('a')
	}
	else {
	}
}</pre> 
<p>
	Note that there is no way for a lexing error to occur here (which makes sense because the rule is optional--though this rule only makes sense when <font face="Courier New">protected</font>).
</p>
<p>
	Semantic predicates get moved along with their associated alternatives when the alternatives are sorted by lookahead depth.&nbsp; It would be weird if the addition of a {true}? predicate (which implicitly exists for each alternative) changed what the lexer recognized!&nbsp; The following rule is reorder so that alternative 2 is tested for first.
</p>
<pre>B	:	{true}? 'a'
	|	'a' 'b'
	;</pre> 
<p>
	Syntactic predicates are <strong>not</strong> reordered.&nbsp; Mentioning the predicate after the rule it conflicts with results in an ambiguity such as is in this rule:
</p>
<pre>F	:	'c'
	|	('c')=&gt; 'c'
	;</pre> 
<p>
	Other alternatives are, however, reordered with respect to the syntactic predicates even when a switch is generated for the LL(1) components and the syntactic predicates are pushed the default case.&nbsp; The following rule illustrates the point.
</p>
<pre>F	:	'b'
	|	{/* empty-path */}
	|	('c')=&gt; 'c'
	|	'c'
	|	'd'
	|	'e'
	;</pre> 
<p>
	Rule F's decision is generated as follows:
</p>
<pre>        switch ( la_1) {
        case 'b':
        {
            match('b');
            break;
        }
        case 'd':
        {
            match('d');
            break;
        }
        case 'e':
        {
            match('e');
            break;
        }
        default:
            boolean synPredMatched15 = false;
            if (((la_1=='c'))) {
                int _m15 = mark();
                synPredMatched15 = true;
                guessing++;
                try {
                    match('c');
                }
                catch (RecognitionException pe) {
                    synPredMatched15 = false;
                }
                rewind(_m15);
                guessing--;
            }
            if ( synPredMatched15 ) {
                match('c');
            }
            else if ((la_1=='c')) {
                match('c');
            }
            else {
                if ( guessing==0 ) {
                    /* empty-path */
                }
            }
        }</pre> 
<p>
	Notice how the empty path got moved after the test for the 'c' alternative.
</p>
<h3><a id="Scanning_Binary_Files"  name="Scanning_Binary_Files"name="Scanning Binary Files">Scanning Binary Files</a></h3> 
<p>
	Character literals are not limited to printable ASCII characters.&nbsp; To demonstrate the concept, imagine that you want to parse a binary file that contains strings and short integers.&nbsp; To distinguish between them, marker bytes are used according to the following format:
</p>
<table border="1" width="100%">
	<tr>
		<th width="50%">
			format
		</th>
		<th width="50%" align="center">
			description
		</th>
	</tr>
	<tr>
		<td width="50%">
			'\0' <em>highbyte lowbyte</em>
		</td>
		<td width="50%" align="center">
			Short integer
		</td>
	</tr>
	<tr>
		<td width="50%">
			'\1' <em>string of non-'\2' chars</em> '\2'
		</td>
		<td width="50%" align="center">
			String
		</td>
	</tr>
</table>
<p>
	Sample input (274 followed by &quot;a test&quot;) might look like the following in hex (output from UNIX <strong>od</strong> <strong>-h</strong> command):
</p>
<pre>0000000000    00 01 12 01 61 20 74 65 73 74 02 </pre> 
<p>
	or as viewed as characters:
</p>
<pre>0000000000    \0 001 022 001 a      t  e  s  t 002</pre> 
<p>
	The parser is trivially just a (...)+ around the two types of input tokens:
</p>
<pre>class DataParser extends Parser;

file:   (   sh:SHORT
            {System.out.println(sh.getText());}
        |   st:STRING
            {System.out.println(&quot;\&quot;&quot;+
               st.getText()+&quot;\&quot;&quot;);}
        )+
    ;</pre> 
<p>
	All of the interesting stuff happens in the lexer.&nbsp; First, define the class and set the vocabulary to be all 8 bit binary values:
</p>
<pre>class DataLexer extends Lexer;
options {
    charVocabulary = '\u0000'..'\u00FF';
}</pre> 
<p>
	Then, define the two tokens according to the specifications, with markers around the string and a single marker byte in front of the short:
</p>
<pre>SHORT
    :   // match the marker followed by any 2 bytes
        '\0' high:. lo:.
        {
        // pack the bytes into a two-byte short
        int v = (((int)high)&lt;&lt;8) + lo;
        // make a string out of the value
        $setText(&quot;&quot;+v);
        }
    ;

STRING
    :   '\1'!   // begin string (discard)
        ( ~'\2' )*
        '\2'!   // end string (discard)
    ;</pre> 
<p>
	To invoke the parser, use something like the following:
</p>
<pre>import java.io.*;

class Main {
    public static void main(String[] args) {
        try {
            // use DataInputStream to grab bytes
            DataLexer lexer =
              new DataLexer(
                new DataInputStream(System.in)
              );
            DataParser parser =
                new DataParser(lexer);
            parser.file();
        } catch(Exception e) {
            System.err.println(&quot;exception: &quot;+e);
        }
    }
}</pre>

<h3><a name="unicode"></a>Scanning Unicode Characters</h3> 
<p>
	ANTLR (as of 2.7.1) allows you to recognize input composed of Unicode characters; that is, you are not restricted to 8 bit ASCII characters.&nbsp; I would like to emphasize that ANTLR <em>allows</em>, but does yet not <em>support</em> Unicode as there is more work to be done.&nbsp; For example, end-of-file is currently incorrectly specified:
</p>
<pre>CharScanner.EOF_CHAR=(char)-1;</pre> 
<p>
	This must be an integer -1 not char, which is actually
	narrowed to 0xFFFF via the cast. &nbsp; I have to go throught
	the entire code base looking for these problems.&nbsp; Plus,
	we should really have a special syntax to mean &quot;java
	identifier character&quot; and some standard encodings for
	non-Western character sets etc...  I expect 2.7.3 to add nice
	predefined character blocks like <tt>LETTER</tt>.
</p>
<p>
The following is a very simple example of how to match a series of space-separated identifiers.
</p>
<pre>class L extends Lexer;

options {
    // Allow any char but \uFFFF (16 bit -1)
    charVocabulary='\u0000'..'\uFFFE';
}

{
    private static boolean done = false;

    public void uponEOF()
        throws TokenStreamException, CharStreamException
    {
        done=true;
    }
    
    public static void main(String[] args) throws Exception {
        L lexer = new L(System.in);
        while ( !done ) {
            Token t = lexer.nextToken();
            System.out.println(&quot;Token: &quot;+t);
        }
    }
}

ID    :    ID_START_LETTER ( ID_LETTER )*
    ;

WS    :    (' '|'\n') {$setType(Token.SKIP);}
    ;

protected
ID_START_LETTER
    :    '$'
    |    '_'
    |    'a'..'z'
    |    '\u0080'..'\ufffe'
    ;

protected
ID_LETTER
    :    ID_START_LETTER
    |    '0'..'9'
    ;</pre> 
<p>
	A final note on Unicode.&nbsp; The ~<em>x</em> &quot;not&quot; operator includes everything in your specified vocabulary (up to 16 bit character space) except <em>x</em>. &nbsp; For example,
</p>
<pre>~('$'|'a'..'z')</pre> 
<p>
	results in every unicode character except '$' and lowercase latin-1 letters, assuming your charVocabulary is 0..FFFF.
</p>
<h3><a id="Manipulating_Token_Text_and_Objects"  name="Manipulating_Token_Text_and_Objects"name="Manipulating Token Text and Objects">Manipulating Token Text and Objects</a></h3> 
<p>
	Once you have specified what to match in a lexical rule, you may ask &quot;what can I discover about what will be matched for each rule element?&quot;&nbsp; ANTLR allows you to label the various elements and, at parse-time, access the text matched for the element. &nbsp; You can even specify the token object to return from the rule and, hence, from the lexer to the parser.&nbsp; This section describes the text and token object handling characteristics of ANTLR.
</p>
<h4><a id="Manipulating_the_Text_of_a_Lexical_Rule"  name="Manipulating_the_Text_of_a_Lexical_Rule"name="_bb14">Manipulating the Text of a Lexical Rule</a></h4> 
<p>
	There are times when you want to look at the text matched for the current rule, alter it, or set the text of a rule to a new string.&nbsp; The most common case is when you want to simply discard the text associated with a few of the elements that are matched for a rule such as quotes.
</p>
<p>
	ANTLR provides the '!' operator that lets you indicate certain elements should not contribute to the text for a token being recognized. The '!' operator is used just like when building trees in the parser. For example, if you are matching the HTML tags and you do not want the '&lt;' and '&gt;' characters returned as part of the token text, you could manually remove them from the token's text before they are returned, but a better way is to suffix the unwanted characters with '!'. For example, the &lt;br&gt; tag might be recognized as follows:
</p>
<pre>BR  :  '&lt;'! &quot;br&quot; '&gt;'! ;	// discard &lt; and &gt;</pre> 
<p>
	Suffixing a lexical rule reference with '!' forces the text matched by the invoked rule to be discarded (it will not appear in the text for the invoking rule).&nbsp; For example, if you do not care about the mantissa of a floating point number, you can suffix the rule that matches it with a '!':
</p>
<pre>FLOAT : INT ('.'! INT!)? ; // keep only first INT</pre> 
<p>
	As a shorthand notation, you may suffix an alternative or rule with '!' to indicate the alternative or rule should not pass any text back to the invoking rule or parser (if nonprotected):
</p>
<pre>// ! on rule: nothing is auto added to text of rule.
rule! : ... ;

// ! on alt: nothing is auto added to text for alt
rule : ... |! ...;</pre> 
<table border="1">
	<tr>
		<th width="175">
			Item suffixed with '!'
		</th>
		<th>
			Effect
		</th>
	</tr>
	<tr>
		<td width="175" align="center">
			char or string literal
		</td>
		<td align="left">
			Do not add text for this atom to current rule's text.
		</td>
	</tr>
	<tr>
		<td width="175" align="center">
			rule reference
		</td>
		<td align="left">
			Do not add text for matched while recognizing this rule to current rule's text.
		</td>
	</tr>
	<tr>
		<td width="175" align="center">
			alternative
		</td>
		<td align="left">
			Nothing that is matched by alternative is added to current rule's text; the enclosing rule contributes nothing to any invoking rule's text.&nbsp; For nonprotected rules, the text for the token returned to parser is blank.
		</td>
	</tr>
	<tr>
		<td width="175" align="center">
			rule definition
		</td>
		<td align="left">
			Nothing that is matched by <strong>any</strong> alternative is added to current rule's text; the rule contributes nothing to any invoking rule's text.&nbsp; For nonprotected rules, the text for the token returned to parser is blank.
		</td>
	</tr>
</table>
<p>
	While the '!' implies that the text is not added to the text for the current rule, you can label an element to access the text (via the token if the element is a rule reference).
</p>
<p>
	In terms of implementation, the characters are always added to the current text buffer, but are carved out when necessary (as this will be the exception rather than the rule, making the normal case efficient).
</p>
<p>
	The '!' operator is great for discarding certain characters or groups of characters, but what about the case where you want to insert characters or totally reset the text for a rule or token?&nbsp; ANTLR provides a series of special methods to do this (we prefix the methods with '$' because Java does not have a macro facility and ANTLR must recognize the special methods in your actions).&nbsp; The following table summarizes.
</p>
<table border="1">
	<tr>
		<th width="175">
			Method
		</th>
		<th>
			Description/Translation
		</th>
	</tr>
	<tr>
		<td align="center" width="175">
			<font face="Courier New">$append(x)</font>
		</td>
		<td>
			Append x to the text of the surrounding rule.&nbsp; Translation: <font face="Courier New">text.append(x)</font>
		</td>
	</tr>
	<tr>
		<td align="center" width="175">
			<font face="Courier New">$setText(x)</font>
		</td>
		<td>
			Set the text of the surrounding rule to x.&nbsp; Translation: <font face="Courier New">text.setLength(_begin); text.append(x)</font>
		</td>
	</tr>
	<tr>
		<td align="center" width="175">
			<font face="Courier New">$getText</font>
		</td>
		<td>
			Return a String of the text for the surrounding rule.&nbsp; Translation;
			<br>
			<font face="Courier New">new String(text.getBuffer(),
				<br>
				_begin,text.length()-_begin)</font>
		</td>
	</tr>
	<tr>
		<td align="center" width="175">
			<font face="Courier New">$setToken(x)</font>
		</td>
		<td>
			Set the token object that this rule is to return.&nbsp; See the section on <a href="#Token Object Creation">Token Object Creation</a>. Translation: <font face="Courier New">_token = x</font>
		</td>
	</tr>
	<tr>
		<td align="center" width="175">
			<font face="Courier New">$setType(x)</font>
		</td>
		<td>
			Set the token type of the surrounding rule.&nbsp; Translation: <font face="Courier New">_ttype = x</font>
		</td>
	</tr>
	<tr>
		<td align="center" width="175">
			<font face="Courier New">setText(x)</font>
		</td>
		<td>
			<font face="Times New Roman">Set the text for the entire token being recognized </font>regardless of what rule the action is in<font face="Times New Roman">. No translation.</font>
		</td>
	</tr>
	<tr>
		<td align="center" width="175">
			<font face="Courier New">getText()</font>
		</td>
		<td>
			<font face="Times New Roman">Get the text for the entire token being recognized </font>regardless of what rule the action is <font face="Times New Roman">in. No translation.</font>
		</td>
	</tr>
</table>
<p>
	One of the great things about an ANTLR generated lexer is that the text of a token can be modified incrementally as the token is recognized (an impossible task for a DFA-based lexer):
</p>
<pre>STRING: '&quot;' ( ESCAPE | ~('&quot;'|'\\') )* '&quot;' ;

protected
ESCAPE
    :    '\\'
         ( 'n' { $setText(&quot;\n&quot;); }
         | 'r' { $setText(&quot;\r&quot;); }
         | 't' { $setText(&quot;\t&quot;); }
         | '&quot;' { $setText(&quot;\&quot;&quot;); }
         )
    ;</pre> <h4><a id=""  name=""name="_bb15"></a><a name="Token Object Creation">Token Object Creation</a></h4> 
<p>
	Because lexical rules can call other rules just like in the parser, you sometimes want to know what text was matched for that portion of the token being matched. To support this, ANTLR allows you to label lexical rules and obtain a <font face="Courier New">Token</font> object representing the text, token type, line number, etc... matched for that rule reference.&nbsp;&nbsp; This ability corresponds to be able to access the text matched for a lexical state in a DFA-based lexer.&nbsp; For example, here is a simple rule that prints out the text matched for a rule reference, INT.
</p>
<pre>INDEX	:	'[' i:INT ']'
		{System.out.println(i.getText());}
	;</pre> <pre>INT	:	('0'..'9')+ ;</pre> 
<p>
	If you moved the labeled reference and action to a parser, it would the same thing (match an integer and print it out).
</p>
<p>
	All lexical rules <em>conceptually</em> return a <font face="Courier New">Token</font> object, but in practice this would be inefficient. ANTLR generates methods so that a token object is created only if any invoking reference is labeled (indicating they want the token object).&nbsp; Imagine another rule that calls INT without a label.
</p>
<pre>FLOAT	:	INT ('.' INT)? ;</pre>
<p>
</font>In this case, no token object is created for either reference to INT<font size="2" face="Courier New">.</font><font face="Times New Roman" size="3">&nbsp; You will notice a boolean argument to every lexical rule that tells it whether or not a token object should be created and returned (via a member variable).&nbsp; </font>All nonprotected rules (those that are &quot;exposed&quot; to the parser) must always generate tokens, which are passed back to the parser.
</p>
<h4><a id="Heterogeneous_Token_Object_Streams"  name="Heterogeneous_Token_Object_Streams"name="_bb16">Heterogeneous Token Object Streams</a></h4> 
<p>
	While token creation is normally handled automatically, you can also manually specify the token object to be returned from a lexical rule. The advantage is that you can pass heterogeneous token objects back to the parser, which is extremely useful for parsing languagues with complicated tokens such as HTML (the <font face="Courier New">&lt;img&gt;</font> and <font face="Courier New">&lt;table&gt;</font> tokens, for example, can have lots of attributes).&nbsp; Here is a rule for the &lt;img&gt; tag that returns a token object of type ImageToken:
</p>
<pre>IMAGE
{
  Attributes attrs;
}
  :  &quot;&lt;img &quot; attrs=ATTRIBUTES '&gt;'
     {
     ImageToken t = new ImageToken(IMAGE,$getText);
     t.setAttributes(attrs);
     $setToken(t);
     }
  ;
ATTRIBUTES returns [Attributes a]
  :  ...
  ;</pre> 
<p>
	The <font face="Courier New">$setToken</font> function specifies that its argument is to be returned when the rule exits.&nbsp; The parser will receive this specific object instead of a <font face="Courier New">CommonToken</font> or whatever else you may have specified with the <font face="Courier New">Lexer.setTokenObjectClass</font> method. &nbsp; The action in rule <font face="Courier New">IMAGE</font> references a token type, <font face="Courier New">IMAGE</font>, and a lexical rule references, <font face="Courier New">ATTRIBUTES</font>, which matches all of the attributes of an image tag and returns them in a data structure called <font face="Courier New">Attributes</font>.
</p>
<p>
	What would it mean for rule <font face="Courier New">IMAGE</font> to be protected (i.e., referenced only from other lexical rules rather than from <font face="Courier New">nextToken</font>)? &nbsp; Any invoking labeled rule reference would receive the object (not the parser) and could examine it, or manipulate it, or pass it on to the invoker of that rule.&nbsp; For example, if <font face="Courier New">IMAGE</font> were called from <font face="Courier New">TAGS</font> rather than being nonprotected, rule <font face="Courier New">TAGS</font> would have to pass the token object back to the parser for it.
</p>
<pre>TAGS : IMG:IMAGE
       {$setToken(img);} // pass to parser
     | PARAGRAPH // probably has no special token
     | ...
     ;</pre> 
<p>
	Setting the token object for a nonprotected rule invoked without a label has no effect other than to waste time creating an object that will not be used.
</p>
<p>
	We use a <tt>CharScanner</tt> member <tt>_returnToken</tt> to do the return in order to not conflict with return values used by the grammar developer. For example, 
</p>
<pre>PTAG: &quot;&lt;p&gt;&quot; {$setToken(new ParagraphToken($$));} ; </pre> 
<p>
	which would be translated to something like: 
</p>
<pre>protected final void mPTAG()
  throws RecognitionException, CharStreamException,
         TokenStreamException {
    Token _token = null;
    match(&quot;&lt;p&gt;&quot;);
    _returnToken =
      new ParagraphToken(<em>text-of-current-rule</em>);
}</pre> <h3><a id=""  name=""name="_bb17"></a><a name="Filtering Input Streams">Filtering Input Streams</a></h3> 
<p>
	You often want to perform an action upon seeing a pattern or two in a complicated input stream, such as pulling out links in an HTML file.&nbsp; One solution is to take the HTML grammar and just put actions where you want.&nbsp; Using a complete grammar is overkill and you may not have a complete grammar to start with.
</p>
<p>
	ANTLR provides a mechanism similar to AWK that lets you say &quot;here are the patterns I'm interested in--ignore everything else.&quot;&nbsp; Naturally, AWK is limited to regular expressions whereas ANTLR accepts context-free grammars (Uber-AWK?).&nbsp; For example, consider pulling out the &lt;p&gt; and &lt;br&gt; tags from an arbitrary HTML file.&nbsp; Using the filter option, this is easy:
</p>
<pre>class T extends Lexer;
options {
    k=2;
    filter=true;
}

P : &quot;&lt;p&gt;&quot; ;
BR: &quot;&lt;br&gt;&quot; ;</pre> 
<p>
	In this &quot;mode&quot;, there is no possibility of a syntax error.&nbsp; Either the pattern is matched exactly or it is filtered out.
</p>
<p>
	This works very well for many cases, but is not sophisticated enough to handle the situation where you want &quot;almost matches&quot; to be reported as errors. &nbsp; Consider the addition of the &lt;table...&gt; tag to the previous grammar:
</p>
<pre>class T extends Lexer; 
options { 
    k=2; 
    filter = true; 
} 

P : &quot;&lt;p&gt;&quot; ; 
BR: &quot;&lt;br&gt;&quot; ; 
TABLE : &quot;&lt;table&quot; (WS)? (ATTRIBUTE)* (WS)? '&gt;' ; 
WS : ' ' | '\t' | '\n' ; 
ATTRIBUTE : ... ;</pre> 
<p>
	Now, consider input &quot;&lt;table 8 = width ;&gt;&quot; (a bogus table definition). As is, the lexer would simply scarf past this input without &quot;noticing&quot; the invalid table. What if you want to indicate that a bad table definition was found as opposed to ignoring it?&nbsp; Call method
</p>
<pre>setCommitToPath(boolean commit)</pre> 
<p>
	in your TABLE rule to indicate that you want the lexer to commit to recognizing the table tag:
</p>
<pre>TABLE
    :   &quot;&lt;table&quot; (WS)?
        {setCommitToPath(true);}
        (ATTRIBUTE)* (WS)? '&gt;'
    ;</pre> 
<p>
	Input &quot;&lt;table 8 = width ;&gt;&quot; would result in a syntax error.&nbsp; Note the placement after the whitespace recognition; you do not want &lt;tabletop&gt; reported as a bad table (you want to ignore it).
</p>
<p>
	One further complication in filtering: What if the &quot;skip language&quot; (the stuff in between valid tokens or tokens of interest) cannot be correctly handled by simply consuming a character and trying again for a valid token?&nbsp; You may want to ignore comments or strings or whatever.&nbsp; In that case, you can specify a rule that scarfs anything between tokens of interest by using option <font face="Courier New">filter=<em>RULE</em></font>. &nbsp; For example, the grammar below filters for &lt;p&gt; and &lt;br&gt; tags as before, but also prints out any other tag (&lt;...&gt;) encountered.
</p>
<pre>class T extends Lexer;
options {
    k=2;
    filter=IGNORE;
    charVocabulary = '\3'..'\177';
}

P : &quot;&lt;p&gt;&quot; ;
BR: &quot;&lt;br&gt;&quot; ;

protected
IGNORE
    :   '&lt;' (~'&gt;')* '&gt;'
        {System.out.println(&quot;bad tag:&quot;+$getText);}
    |   ( &quot;\r\n&quot; | '\r' | '\n' ) {newline();}
    |   .
    ;</pre> 
<p>
	Notice that the filter rule must track newlines in the general case where the lexer might emit error messages so that the line number is not stuck at 0.
</p>
<p>
	The filter rule is invoked either when the lookahead (in nextToken) predicts none of the nonprotected lexical rules or when one of those rules fails.&nbsp; In the latter case, the input is rolled back before attempting the filter rule.&nbsp; Option <font face="Courier New">filter=true</font> is like having a filter rule such as:
</p>
<pre>IGNORE : . ;</pre> 
<p>
	Actions in regular lexical rules are executed even if the rule fails and the filter rule is called.&nbsp; To do otherwise would require every valid token to be matched twice (once to match and once to do the actions like a syntactic predicate)! Plus, there are few actions in lexer rules (usually they are at the end at which point an error cannot occur).
</p>
<p>
	Is the filter rule called when commit-to-path is true and an error is found in a lexer rule? No, an error is reported as with filter=true.
</p>
<p>
	What happens if there is a syntax error in the filter rule?&nbsp; Well, you can either put an exception handler on the filter rule or accept the default behavior, which is to consume a character and begin looking for another valid token.
</p>
<p>
	In summary, the filter option allows you to:
<ol>
	<li>
		Filter like awk (only perfect matches reported--no such thing as syntax error) 
	</li>
	<li>
		Filter like awk + catch poorly-formed matches (that is, &quot;almost matches&quot; like &lt;table 8=3;&gt; result in an error) 
	</li>
	<li>
		Filter but specify the skip language 
	</li>
</ol>
<h4><a id="ANTLR_Masquerading_as_SED"  name="ANTLR_Masquerading_as_SED"name="ANTLR Masquerading as SED">ANTLR Masquerading as SED</a></h4> 
<p>
	To make ANTLR generate lexers that behave like the UNIX utility sed (copy standard in to standard out except as specified by the replace patterns), use a filter rule that does the input to output copying:
</p>
<font size="2"><pre></font><font size="3"><font face="Courier New">class T extends Lexer;
options {
  k=2;
  filter=IGNORE;
  charVocabulary = '\3'..'\177';
}</font></pre> <pre><font face="Courier New">P  : &quot;&lt;p&gt;&quot; {System.out.print(&quot;&lt;P&gt;&quot;);};
BR : &quot;&lt;br&gt;&quot; {System.out.print(&quot;&lt;BR&gt;&quot;);};</font></pre> <pre><font face="Courier New">protected
IGNORE
  :  ( &quot;\r\n&quot; | '\r' | '\n' )
     {newline(); System.out.println(&quot;&quot;);}
  |  c:. {System.out.print(c);}
  ;</font></pre> </font>
<p>
	This example dumps anything other than &lt;p&gt; and &lt;br&gt; tags to standard out and pushes lowercase &lt;p&gt; and &lt;br&gt; to uppercase. Works great.
</p>
<h4><a id="Nongreedy_Subrules"  name="Nongreedy_Subrules"name="Nongreedy Subrules">Nongreedy Subrules</a></h4> 
<p>
	Quick:&nbsp; What does the following match?
</p>
<pre>BLOCK : '{' (.)* '}';</pre> 
<p>
	Your first reaction is that it matches any set of characters inside of curly quotes. &nbsp; In reality, it matches '{' followed by every single character left on the input stream!&nbsp; Why?&nbsp; Well, because ANTLR loops are <em>greedy</em>--they consume as much input as they can match.&nbsp; Since the wildcard matches any character, it consumes the '}' and beyond.&nbsp; This is a pain for matching strings, comments and so on.
</p>
<p>
	Why can't we switch it around so that it consumes only until it sees something on the input stream that matches what <strong>follows</strong> the loop, such as the '}'? &nbsp; That is, why can't we make loops <em>nongreedy</em>?&nbsp; The answer is we can, but sometimes you want greedy and sometimes you want nongreedy (PERL has both kinds of closure loops now too).&nbsp; Unfortunately, parsers usually want greedy and lexers usually want nongreedy loops.&nbsp; Rather than make the same syntax behave differently in the various situations, Terence decided to leave the semantics of loops as they are (greedy) and make a subrule option to make loops nongreedy.
</p>
<h4><a id="Greedy_Subrules"  name="Greedy_Subrules"name="Greedy Parser Subrules">Greedy Subrules</a></h4> 
<p>
	I have yet to see a case when building a parser grammar where I did not want a subrule to match as much input as possible.&nbsp; For example, the solution to the classic if-then-else clause ambiguity is to match the &quot;else&quot; as soon as possible:
</p>
<pre>stat : &quot;if&quot; expr &quot;then&quot; stat (&quot;else&quot; stat)?
     | ...
     ;</pre> 
<p>
	This ambiguity (which statement should the &quot;else&quot; be attached to) results in a parser nondeterminism.&nbsp; ANTLR warns you about the <font face="Courier New">(...)?</font> subrule as follows:
</p>
<pre>warning: line 3: nondeterminism upon
        k==1:&quot;else&quot;
        between alts 1 and 2 of block</pre> 
<p>
	If, on the other hand, you make it clear to ANTLR that you want the subrule to match greedily (i.e., assume the default behavior), ANTLR will not generate the warning. &nbsp; Use the <font face="Courier New">greedy</font> subrule option to tell ANTLR what you want:
</p>
<pre>stat : &quot;if&quot; expr &quot;then&quot; stat
       ( options {greedy=true;} : &quot;else&quot; stat)?
     | ID
     ;</pre> 
<p>
	You are not altering the behavior really, since ANTLR was going to choose to match the &quot;else&quot; anyway, but you have avoided a warning message.
</p>
<p>
	There is no such thing as a nongreedy <font face="Courier New">(...)?</font> subrule because telling an optional subrule not to match anything is the same as not specifying the subrule in the first place.&nbsp; If you make the subrule nongreedy, you will see:
</p>
<pre>warning in greedy.g: line(4),
        Being nongreedy only makes sense
        for (...)+ and (...)*
warning: line 4: nondeterminism upon
        k==1:&quot;else&quot;
        between alts 1 and 2 of block</pre> 
<p>
	Greedy subrules are very useful in the lexer also.&nbsp; If you want to grab any whitespace on the end of a token definition, you can try (WS)? for some whitespace rule WS:
</p>
<pre>ID : ('a'..'z')+ (WS)? ;</pre> 
<p>
	However, if you want to match ID in a loop in another rule that could also match whitespace, you will run into a nondeterminism warning.&nbsp; Here is a contrived loop that conflicts with the (WS)? in ID:
</p>
<pre>LOOP : (  ID
       |  WS
       )+
     ;</pre> 
<p>
	The whitespace on the end of the ID could be matched in ID or in LOOP now.&nbsp; ANTLR chooses to match the WS immediately, in ID.&nbsp; To shut off the warning, simply tell ANTLR that you mean for it do be greedy, it's default behavior:
</p>
<pre>ID : ('a'..'z')+ (options {greedy=true;}:WS)? ;</pre>

<h4><a id="Nongreedy_Lexer_Subrules"  name="Nongreedy_Lexer_Subrules"name="Nongreedy Lexer Subrules">Nongreedy Lexer Subrules</a></h4> 
<p>
	ANTLR's default behavior of matching as much as possible in loops and optional subrules is sometimes not what you want in lexer grammars.&nbsp; Most loops that match &quot;a bunch of characters&quot; in between markers, like curly braces or quotes, should be nongreedy loops.&nbsp; For example, to match a nonnested block of characters between curly braces, you want to say:
</p>
<pre>CURLY_BLOCK_SCARF
&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;'{' (.)* '}'
    ;</pre> 
<p>
	Unfortunately, this does not work--it will consume everything after the '{' until the end of the input.&nbsp; The wildcard matches anything including '}' and so the loop merrily consumes past the ending curly brace.
</p>
<p>
	To force ANTLR to break out of the loop when it sees a lookahead sequence consistent with what follows the loop, use the greedy subrule option:
</p>
<pre>CURLY_BLOCK_SCARF
&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;'{'
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; options {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; greedy=false;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '}'
&nbsp;&nbsp;&nbsp; ;</pre> 
<p>
	To properly take care of newlines inside the block, you should really use the following version that &quot;traps&quot; newlines and bumps up the line counter:
</p>
<pre>CURLY_BLOCK_SCARF
&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;'{'
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; options {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; greedy=false;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;'\r' ('\n')? {newline();}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;'\n' &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {newline();}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '}'
&nbsp;&nbsp;&nbsp; ;</pre>

<h4><a id="Limitations_of_Nongreedy_Subrules"  name="Limitations_of_Nongreedy_Subrules"name="Limitations of Greedy Subrules">Limitations of Nongreedy Subrules</a></h4> 
<p>
	What happens when what follows a nongreedy subrule is not as simple as a single &quot;marker&quot; character like a right curly brace (i.e., what about when you need k&gt;1 to break out of a loop)?&nbsp; ANTLR will either &quot;do the right thing&quot; or warn you that it might not.
</p>
<p>
	First, consider the matching C comments:
</p>
<pre>CMT : &quot;/*&quot; (.)* &quot;*/&quot; ;</pre> 
<p>
	As with the curly brace matching, this rule will not stop at the end marker because the wildcard matches the &quot;*/&quot; end marker as well.&nbsp; You must tell ANTLR to make the loop nongreedy:
</p>
<pre>CMT : &quot;/*&quot; (options {greedy=false;} :.)* &quot;*/&quot; ;</pre> 
<p>
	You will not get an error and ANTLR will generate an exit branch
</p>
<pre>do {
    // nongreedy exit test
    if ((LA(1)=='*')) break _loop3;
    ...</pre> 
<p>
	Ooops.&nbsp; k=1, which is not enough lookahead.&nbsp; ANTLR did not generate a warning because it assumes you are providing enough lookahead for all nongreedy subrules. &nbsp; ANTLR cannot determine how much lookahead to use or how much is enough because, by definition, the decision is ambiguous--it simply generates a decision using the maximum lookahead. 
</p>
<p>
	You must provide enough lookahead to let ANTLR see the full end marker:
</p>
<pre>class L extends Lexer;
options {
        k=2;
}

CMT : &quot;/*&quot; (options {greedy=false;} :.)* &quot;*/&quot; ;</pre> 
<p>
	Now, ANTLR will generate an exit branch using k=2.
</p>
<pre>do {
    // nongreedy exit test
    if ((LA(1)=='*') &amp;&amp; (LA(2)=='/'))
        break _loop3;
    ...</pre> 
<p>
	If you increase k to 3, ANTLR will generate an exit branch using k=3 instead of 2, even though 2 is sufficient.&nbsp; We know that k=2 is ok, but ANTLR is faced with a nondeterminism as it will use as much information as it has to yield a deterministic parser.
</p>
<p>
	There is one more issue that you should be aware of.&nbsp; Because ANTLR generates linear approximate decisions instead of full LL(k) decisions, complicated &quot;end markers&quot; can confuse ANTLR.&nbsp; Fortunately, ANTLR knows when it is confused and will let you know.
</p>
<p>
	Consider a simple contrived example where a loop matches either ab or cd:
</p>
<pre>R : (   options {greedy=false;}
    :   (&quot;ab&quot;|&quot;cd&quot;)
    )+
    (&quot;ad&quot;|&quot;cb&quot;)
  ;</pre> 
<p>
	Following the loop, the grammar can match ad or cb.&nbsp; These exact sequences are not a problem for a full LL(k) decision, but due to the extreme compression of the linear approximate decision, ANTLR will generate an inaccurate exit branch.&nbsp; In other words, the loop will exit, for example, on ab even though that sequence cannot be matched following the loop.&nbsp;&nbsp; The exit condition is as follows:
</p>
<pre>// nongreedy exit test
if ( _cnt10&gt;=1 &amp;&amp; (LA(1)=='a'||LA(1)=='c') &amp;&amp;
     (LA(2)=='b'||LA(2)=='d')) break _loop10;</pre> 
<p>
	where the <font face="Courier New">_cnt10</font> term ensures the loop goes around at least once (but has nothing to do with the nongreedy exit branch condition really). &nbsp; Note that ANTLR has compressed all characters that can possibly be matched at a lookahead depth into a single set, thus, destroying the sequence information.&nbsp; The decision matches the cross product of the sets, including the spurious lookahead sequences such as ab.
</p>
<p>
	Fortunately, ANTLR knows when a decision falls between its approximate decision and a full LL(k) decision--it warns you as follows:
</p>
<pre><small>warning in greedy.g: line(3),</small>
<small>    nongreedy block may exit incorrectly due
    to limitations of linear approximate lookahead</small>
<small>    (first k-1 sets in lookahead not singleton).</small></pre> 
<p>
	The parenthetical remark gives you a hint that some k&gt;1 lookahead sequences are correctly predictable even with the linear approximate lookahead compression.&nbsp; The idea is that if all sets for depths 1..(k-1) are singleton sets (exactly one lookahead sequence for first k-1 characters) then linear approximate lookahead compression does not weaken your parser.&nbsp; So, the following variant does not yield a warning since the exit branch is linear approximate as well as full LL(k):
</p>
<pre>R : (   options {greedy=false;}
    :   .
    )+
    (&quot;ad&quot;|&quot;ae&quot;)
  ;</pre> 
<p>
	The exit branch decision now tests lookahead as follows:
</p>
<pre>   (LA(1)=='a') &amp;&amp; (LA(2)=='d'||LA(2)=='e')</pre> 
<p>
	which accurately predicts when to exit.
</p>

<h3><a id="Lexical_States"  name="Lexical_States"name="LexicalStates">Lexical States</a></h3> 

<p>With DFA-based lexer generates such as <tt>lex</tt>, you often need
to match pieces of your input with separate sets of rules called
lexical states.  In ANTLR, you can simply define another rule and call
it like any other to switch "states".  Better yet, this "state" rule
can be reused by other parts of your lexer grammar because the method
return stack tells the lexer which rule to return to.  DFAs have no
stacks unlike recursive-descent parsers and, hence, can only switch
back to one hard-coded rule.

<p> Consider an example where you would normally see a lexical
state--that of matching escape characters within a string.  You would
attach an action to the double quote character that switched state to
a <tt>STRING_STATE</tt> state.  This subordinate state would then
define rules for matching the various escapes and finally define a
rule for double quote that whose action would switch you back to the
normal lexical state.  To demonstrate the solution with ANTLR, let's
start with just a simple string definition:

<tt><pre>
/** match anything between double-quotes */
STRING : '"' (~'"')* '"' ;
</pre></tt>

To allow escape characters like <tt>\t</tt>, you need to add an
alternative to the (...)* loop.  (You could do that with a DFA-based
lexer as well, but you could not have any actions associated with the
escape character alternatives to do a replacement etc...).  For
convenience, collect all escape sequences in another rule called <tt>ESC</tt>:

<tt><pre>
STRING : '"' (ESC | ~('\\'|'"'))* '"' ;

protected
ESC    : '\\' ('t' {...} | '"' {...} )* ;
</pre></tt>

The <tt>protected</tt> is a (poorly named) indicator that the rule,
<tt>ESC</tt>, is not a token to be returned to the parser.  It just
means that the <tt>nextToken</tt> method does not attempt to route
recognition flow directly to that rule--<tt>ESC</tt> must be called
from another lexer rule.

<p> This works for simple escapes, but does not include escapes like
<tt>\20</tt>.  To fix it, just add a reference to another rule
<tt>INT</tt> that you probably have already defined.

<tt><pre>
STRING : '"' (ESC | ~('\\'|'"'))* '"' ;

protected
ESC    : '\\' ('t' {...} | '"' {...} | INT {...})* ;

INT    : ('0'..'9')+ ;
</pre></tt>

Notice that <tt>INT</tt> is a real token that you want the parser to
see so the rule is not <tt>protected</tt>.  A rule may invoke any
other rule, <tt>protected</tt> or not.

<p> Lexical states with DFA-based lexers merely allow you to recognize
complicated tokens more easily--the parser has no idea the contortions
the lexer goes through.  There are some situations where you might
want multiple, completely-separate lexers to feed your
parser.  One such situation is where you have an embedded language
such as javadoc comments.  ANTLR has the ability to switch between
multiple lexers using a token stream multiplexor.&nbsp; Please see the
discussion in <a href="streams.html#lexerstates">streams</a>.</p>

<h3><a id="The_End_Of_File_Condition"  name="The_End_Of_File_Condition"name="The End Of File Condition">The End Of File Condition</a></h3> 
<p>
	<font size="3">A method is available for reacting to the end of file condition as if it were an event; e.g., you might want to pop the lexer state at the end of an include file. &nbsp; This method, <font face="Courier New">CharScanner.uponEOF()</font>, is called from <font face="Courier New">nextToken()</font> right before the scanner returns an <font face="Courier New">EOF_TYPE</font> token object to parser:</font>
</p>
<pre><font size="3">public void uponEOF()|
    throws TokenStreamException, CharStreamException;</font></pre> 
<p>
	<font size="3">This event is not generated during a syntactic predicate evaluation (i.e., when the parser is guessing) nor in the middle of the recognition of a lexical rule (that would be an IO exception).&nbsp; This event is generated only after the complete evaluation of the last token and upon the next request from the parser for a token.</font>
</p>
<p>
	<font size="3">You can throw exceptions from this method like &quot;Heh, premature eof&quot; or a retry stream exception.&nbsp; See the includeFile/P.g for an example usage.</font>
</p>
<h3><a id="Case_sensitivity"  name="Case_sensitivity"name="casesensitivity">Case sensitivity</a></h3> 
<p>
	You may use option <font face="Courier New">caseSensitive=false</font> in the lexer to indicate that you do not want case to be significant when matching characters against the input stream. For example, you want element <font face="Courier New">'d'</font> to match either upper or lowercase D, however, you do not want to change the case of the input stream. We have implemented this feature by having the lexer's <font face="Courier New">LA()</font> lookahead method return lowercase versions of the characters. Method <font face="Courier New">consume()</font> still adds the original characters to the string buffer associated with a token. We make the following notes:
<ul>
	<li>
		The lowercasing is done by a method <font face="Courier New">toLower()</font> in the lexer. This can be overridden to get more specific case processing.&nbsp;&nbsp; using option caseSensitive calls method <font face="Courier New">CharScanner.setCaseSensitive(...)</font>, which you can also call before (or during I suppose) the parse.
	</li>
	<li>
		ANTLR issues a warning when <font size="2" face="Courier New">caseSensitive=false</font> and uppercase ASCII characters are used in character or string literals.
	</li>
</ul>
<p>
	Case sensitivity for literals is handled separately. That is, set lexer option <font size="2" face="Courier New">caseSensitiveLiterals</font> to false when you want the literals testing to be case-insensitive. Implementing this required changes to the literals table. Instead of adding a String, it adds an ANTLRHashString that implements a case-insensitive or case-sensitive hashing as desired.
</p>
<p>
	Note: ANTLR checks the characters of a lexer string to make sure they are lowercase, but does not process escapes correctly--put that one on the &quot;to do&quot; list.
</p>
<h3><a id=""  name=""name="_bb18"></a><a name="ignoringwhitespace">Ignoring whitespace in the lexer</a></h3> 
<p>
	One of the great things about ANTLR is that it generates full predicated-LL(k) lexers rather than the weaker (albeit sometimes easier-to-specify) DFA-based lexers of DLG. With such power, you are tempted (and encouraged) to do real parsing in the lexer. A great example of this is HTML parsing, which begs for a two-level parse: the lexer parsers all the attributes and so on within a tag, but the parser does overall document structure and ordering of the tags etc... The problem with parsing within a lexer is that you encounter the usual &quot;ignore whitespace&quot; issue as you do with regular parsing.
</p>
<p>
	For example, consider matching the <font face="Courier New">&lt;table&gt;</font> tag of HTML, which has many attributes that can be specified within the tag. A first attempt might yield:
</p>
<pre>OTABLE   :	&quot;&lt;table&quot; (ATTR)* '&gt;'
         ;</pre> 
<p>
	Unfortunately, input &quot;<font face="Courier New">&lt;table border=1&gt;</font>&quot; does not parse because of the blank character after the <font face="Courier New">table</font> identifier. The solution is not to simply have the lexer ignore whitespace as it is read in because the lookahead computations must see the whitespace characters that will be found in the input stream. Further, defining whitespace as a rudimentary set of things to ignore does not handle all cases, particularly difficult ones, such as comments inside tags like
</p>
<pre>&lt;table &lt;!--wow...a comment--&gt; border=1&gt;</pre> 
<p>
	The correct solution is to specify a rule that is called after each lexical element (character, string literal, or lexical rule reference). We provide the lexer rule option <font face="Courier New">ignore</font> to let you specify the rule to use as whitespace. The solution to our HTML whitespace problem is therefore:
</p>
<pre>TABLE	
options { ignore=WS; }
       :	&quot;&lt;table&quot; (ATTR)* '&gt;'
       ;</pre> <pre>// can be protected or non-protected rule
WS     :	' ' | '\n' | COMMENT | ...
       ;</pre> 
<p>
	We think this is cool and we hope it encourages you to do more and more interesting things in the lexer!
</p>
<p>
	Oh, almost forgot. There is a <strong>bug</strong> in that an extra whitespace reference is inserted after the end of a lexer alternative if the last element is an action. The effect is to include any whitespace following that token in that token's text.
</p>
<h3><a id=""  name=""name="_bb19"></a><a name="trackingline">Tracking Line Information</a></h3> 
<p>
	Each lexer object has a <font size="2" face="Courier New">line</font> member that can be incremented by calling <font size="2" face="Courier New">newline()</font> or by simply changing its value (e.g., when processing <font size="2" face="Courier New">#line</font> directives in C).
</p>
<pre>SL_COMMENT : &quot;//&quot; (~'\n')* '\n' {newline();} ;</pre> 
<p>
	Do not forget to split out &#145;<font face="Courier New">\n</font>&#146; recognition when using the not operator to read until a stopping character such as:
</p>
<pre>BLOCK: '('
           ( '\n' { newline(); }
           | ~( &#145;\n&#146; | &#145;)&#146; )
           )*
       ')&#146;
     ;</pre> 
<p>
	Another way to track line information is to override the <font size="2" face="Courier New">consume()</font> method:
</p>
<h3><a id=""  name=""name="_bb20"></a><a name="trackingcolumn">Tracking Column Information</a></h3> 
<p>
	ANTLR (2.7.1 and beyond), tracks character column information so that each token knows what column it starts in; columns start at 1 just like line numbers.&nbsp; The CharScanner.consume() method asks method tab() to update the column number if it sees a tab, else it just increments the column number:
</p>
<pre>    ...
    if ( c=='\t' ) {
	tab();
    }
    else {
	inputState.column++;
    }</pre> 
<p>
	By default, tab() is defined as follows:
</p>
<pre><font face="Courier New">/**<big>
</big>advance the current column number by an appropriate
amount. If you do not override this to specify how
much to jump for a tab, then tabs are counted as
 one char. This method is called from consume().
*/
public void tab() {
  // update inputState.column as function of
  // inputState.column and tab stops.
  // For example, if tab stops are columns 1<big>
</big>  // and 5 etc... and column is 3, then add 2<big>
</big>  // to column.
  inputState.column++;
}</font></pre> 
<p>
	Upon new line, the lexer needs to reset the column number to 1.&nbsp; Here is the default implementation of CharScanner.newline():
</p>
<pre>    public void newline() {
	inputState.line++;
	inputState.column = 1;
    }
</pre> 
<p>
	Do not forget to call newline() in your lexer rule that matches '\n' lest the column number not be reset to 1 at the start of a line.
</p>
<p>
	The shared input state object for a lexer is actually the critter that tracks the column number (as well as the starting column of the current token):
</p>
<pre>public class LexerSharedInputState {
    protected int column=1;
    protected int line=1;
    protected int tokenStartColumn = 1;
    protected int tokenStartLine = 1;
    ...
}</pre> 
<p>
	If you want to handle tabs in your lexer, just implement a method like the following to override the standard behavior. 
</p>
<pre>/** set tabs to 4, just round column up to next tab + 1
12345678901234567890
    x   x   x   x
 */
public void tab() {
	int t = 4;
	int c = getColumn();
	int nc = (((c-1)/t)+1)*t+1;
	setColumn( nc );
}</pre> 
<p>
	See the <font face="Courier New">examples/java/columns</font> directory for the complete example.
</p>
<h3><a id=""  name=""name="_bb21"></a><a name="usingexplicit">Using Explicit Lookahead</a></h3> 
<p>
	On rare occasions, you may find it useful to explicitly test the lexer lookahead in say a semantic predicate to help direct the parse. For example, /*...*/ comments have a two character stopping symbol. The following example demonstrates how to use the second symbol of lookahead to distinguish between a single '/' and a &quot;*/&quot;:
</p>
<pre>ML_COMMENT
    :    &quot;/*&quot;
         (  { LA(2)!='/' }? '*'
         | '\n' { newline(); }
         | ~('*'|'\n')
         )*
         &quot;*/&quot;
    ;</pre> 
<p>
	The same effect might be possible via a syntactic predicate, but would be much slower than a semantic predicate.&nbsp; A DFA-based lexer handles this with no problem because they use a bunch of (what amount to) gotos whereas we're stuck with structured elements like while-loops.
</p>
<h3><a id=""  name=""name="_bb22"></a><a name="surprisinguse">A Surprising Use of A Lexer: Parsing</a></h3> 
<p>
	The following set of rules match arithmetical expressions in a lexer <strong>not</strong> a parser (whitespace between elements is not allowed in this example but can easily be handled by specifying rule option <font face="Courier New">ignore</font> for each rule):
</p>
<pre>EXPR
{ int val; }
    :   val=ADDEXPR
        { System.out.println(val); }
    ;

protected
ADDEXPR returns [int val]
{ int tmp; }
    :   val=MULTEXPR
        ( '+' tmp=MULTEXPR { val += tmp; }
        | '-' tmp=MULTEXPR { val -= tmp; }
        )*
    ;

protected
MULTEXPR returns [int val]
{ int tmp; }
    :   val=ATOM
        (   '*' tmp=ATOM { val *= tmp; }
        |   '/' tmp=ATOM { val /= tmp; }
        )*
    ;

protected 
ATOM returns [int val]
    :   val=INT
    |   '(' val=ADDEXPR ')'
    ;

protected
INT returns [int val]
    :   ('0'..'9')+
        {val=Integer.valueOf($getText);}
    ;</pre> <h3><a id=""  name=""name="_bb23"></a><a name="dfacompare">But...We've Always Used Automata For Lexical Analysis!</a></h3> 
<p>
	Lexical analyzers were all built by hand in the early days of compilers until DFAs took over as the scanner implementation of choice. DFAs have several advantages over hand-built scanners: 
<ul>
	<li>
		DFAs can easily be built from terse regular expressions. 
	</li>
	<li>
		DFAs do automatic left-factoring of common (possibly infinite) left-prefixes. In a hand-built scanner, you have to find and factor out all common prefixes. For example, consider writing a lexer to match integers and floats. The regular expressions are straightforward: <pre><tt>integer : &quot;[0-9]+&quot; ;
real    : &quot;[0-9]+{.[0-9]*}|.[0-9]+&quot; ;</tt>    </pre> 
		<p>
			Building a scanner for this would require factoring out the common <tt>[0-9]+</tt>. For example, a scanner might look like: 
		</p>
<pre><tt>Token nextToken() {
  if ( Character.isDigit(c) ) {
    <i>match an integer</i>
    if ( c=='.' ) {
      <i>match another integer</i>
      return new Token(REAL);
    }
    else {
      return new Token(INT);
    }
  }
  else if ( c=='.' ) {
    <i>match a float starting with .</i>
    return new Token(REAL);
  }
  else ...
}</tt>  </pre> 
	</li>
</ul>
<p>
	Conversely, hand-built scanners have the following advantages over DFA implementations: 
<ul>
	<li>
		Hand-built scanners are not limited to the regular class of languages. They may use semantic information and method calls during recognition whereas a DFA has no stack and is typically not semantically predicated. 
	</li>
	<li>
		Unicode (16 bit values) is handled for free whereas DFAs typically have fits about anything but 8 bit characters. 
	</li>
	<li>
		DFAs are tables of integers and are, consequently, very hard to debug and examine. 
	</li>
	<li>
		A tuned hand-built scanner can be faster than a DFA. For example, simulating the DFA to match <tt>[0-9]+</tt> requires <i>n</i> DFA state transitions where <i>n</i> is the length of the integer in characters. 
		<p>
			Tom Pennello of Metaware back in 1986 (&quot;Very Fast LR Parsing&quot;) generated LR-based parsers in machine code that used the program counter to do state transitions rather than simulating the PDA. He got a huge speed up in parse time. We can extrapolate from this experiment that avoiding a state machine simulator in favor of raw code results in a speed up. 
		</p>
	</li>
</ul>
<p>
	So, what approach does ANTLR take? Neither! ANTLR allows you to specify lexical items with expressions, but generates a lexer for you that mimics what you would generate by hand. The only drawback is that you still have to do the left-factoring for some token definitions (but at least it is done with expressions and not code). This hybrid approach allows you to build lexers that are much stronger and faster than DFA-based lexers while avoiding much of the overhead of writing the lexer yourself.
</p>
<p>
	In summary, specifying regular expressions is simpler and shorter than writing a hand-built lexer, but hand-built lexers are faster, stronger, able to handle unicode, and easy to debug. This analysis has led many programmers to write hand-built lexers even when DFA-generation tools such as <tt>lex</tt> and <tt>dlg</tt> are commonly-available. PCCTS 1.xx made a parallel argument concerning PDA-based LR parsers and recursive-descent LL-based parsers. As a final justification, we note that writing lexers is trivial compared to building parsers; also, once you build a lexer you will reuse it with small modifications in the future.
</p>
<pre><font face="Arial" size="2">Version: $Id: //depot/code/org.antlr/release/antlr-2.7.5/doc/lexer.html#1 $</font></pre> 
</body>
</html>
Back to Top