/projects/javacc-5.0/www/doc/simpleREADME.html
HTML | 482 lines | 434 code | 17 blank | 31 comment | 0 complexity | 3f5b0b14c6be5b522cb28d2b57d6115e MD5 | raw file
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
- <html xmlns="http://www.w3.org/1999/xhtml">
- <!--
- Copyright (c) 2006, Sun Microsystems, Inc.
- All rights reserved.
- Redistribution and use in source and binary forms, with or without
- modification, are permitted provided that the following conditions are met:
- * Redistributions of source code must retain the above copyright notice,
- this list of conditions and the following disclaimer.
- * Redistributions in binary form must reproduce the above copyright
- notice, this list of conditions and the following disclaimer in the
- documentation and/or other materials provided with the distribution.
- * Neither the name of the Sun Microsystems, Inc. nor the names of its
- contributors may be used to endorse or promote products derived from
- this software without specific prior written permission.
- THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
- LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
- INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
- CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
- ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
- THE POSSIBILITY OF SUCH DAMAGE.
- -->
- <head>
- <title>JavaCC README for SimpleExamples</title>
- <!-- Changed by: Michael Van De Vanter, 14-Jan-2003 -->
- </head>
- <body bgcolor="#FFFFFF" >
- <h1>JavaCC [tm]: README for SimpleExamples</h1>
- <p>
- This directory contains five examples to get you started using JavaCC.
- Each example is contained in a single grammar file and are listed
- below:
- </p>
- <pre>
- Simple1.jj
- Simple2.jj
- Simple3.jj
- Simple4.jj
- NL_Xlator.jj
- IdList.jj
- </pre>
- <p>
- Once you have tried out and understood each of these examples, you
- should take a look at more complex examples in other sub-directories
- under the examples directory. But even with just these examples, you
- should be able to get started on reasonable complex grammars.
- </p>
- <h2>Summary Instructions</h2>
- <p>
- If you are a parser and lexical analyzer expert and can understand the
- examples by just reading them, the following instructions show you how
- to get started with JavaCC. The instructions below are with respect
- to Simple1.jj, but you can build any parser using the same set of
- commands.
- </p>
- <ol>
- <li>
- Run javacc on the grammar input file to generate a bunch of Java
- files that implement the parser and lexical analyzer (or token
- manager):
- <pre>
- javacc Simple1.jj
- </pre>
- </li>
- <li> Now compile the resulting Java programs:
- <pre>
- javac *.java
- </pre>
- </li>
- <li> The parser is now ready to use. To run the parser, type:
- <pre>
- java Simple1
- </pre>
- </li>
- </ol>
- <p>
- The Simple1 parser and others in this directory are designed to take
- input from standard input. Simple1 recognizes matching braces
- followed by zero or more line terminators and then an end of file.
- </p>
- <p>
- Examples of legal strings in this grammar are:
- <pre>
- "{}", "{{{{{}}}}}", etc.
- </pre>
- Examples of illegal strings are:
- <pre>
- "{{{{", "{}{}", "{}}", "{{}{}}", "{ }", "{x}", etc.
- </pre>
- <p>
- Try typing various different inputs to Simple1. Remember <control-d>
- may be used to indicate the end of file (this is on the UNIX platform).
- Here are some sample runs:
- </p>
- <pre>
- % java Simple1
- {{}}<return>
- <control-d>
- %
- % java Simple1
- {x<return>
- Lexical error at line 1, column 2. Encountered: "x"
- TokenMgrError: Lexical error at line 1, column 2. Encountered: "x" (120), after : ""
- at Simple1TokenManager.getNextToken(Simple1TokenManager.java:146)
- at Simple1.getToken(Simple1.java:140)
- at Simple1.MatchedBraces(Simple1.java:51)
- at Simple1.Input(Simple1.java:10)
- at Simple1.main(Simple1.java:6)
- %
- % java Simple1
- {}}<return>
- ParseException: Encountered "}" at line 1, column 3.
- Was expecting one of:
- <EOF>
- "\n" ...
- "\r" ...
- at Simple1.generateParseException(Simple1.java:184)
- at Simple1.jj_consume_token(Simple1.java:126)
- at Simple1.Input(Simple1.java:32)
- at Simple1.main(Simple1.java:6)
- %
- </pre>
- <h2>DETAILED DESCRIPTION OF Simple1.jj</h2>
- <p>
- This is a simple JavaCC grammar that recognizes a set of left braces
- followed by the same number of right braces and finally followed by
- zero or more line terminators and finally an end of file. Examples of
- legal strings in this grammar are:
- </p>
- <pre>
- "{}", "{{{{{}}}}}", etc.
- </pre>
- Examples of illegal strings are:
- <pre>
- "{{{{", "{}{}", "{}}", "{{}{}}", etc.
- </pre>
- <p>
- This grammar file starts with settings for all the options offered by
- JavaCC. In this case the option settings are their default values.
- Hence these option settings were really not necessary. One could as
- well have completely omitted the options section, or omitted one or
- more of the individual option settings. The details of the individual
- options is described in the JavaCC documentation in the web pages.
- </p>
- <p>
- Following this is a Java compilation unit enclosed between
- "PARSER_BEGIN(name)" and "PARSER_END(name)". This compilation unit
- can be of arbitrary complexity. The only constraint on this
- compilation unit is that it must define a class called "name" - the
- same as the arguments to PARSER_BEGIN and PARSER_END. This is the
- name that is used as the prefix for the Java files generated by the
- parser generator. The parser code that is generated is inserted
- immediately before the closing brace of the class called "name".
- </p>
- <p>
- In the above example, the class in which the parser is generated
- contains a main program. This main program creates an instance of the
- parser object (an object of type Simple1) by using a constructor that
- takes one argument of type java.io.InputStream ("System.in" in this
- case).
- </p>
- <p>
- The main program then makes a call to the non-terminal in the grammar
- that it would like to parse - "Input" in this case. All non-terminals
- have equal status in a JavaCC generated parser, and hence one may
- parse with respect to any grammar non-terminal.
- </p>
- <p>
- Following this is a list of productions. In this example, there are
- two productions that define the non-terminals "Input" and
- "MatchedBraces" respectively. In JavaCC grammars, non-terminals are
- written and implemented (by JavaCC) as Java methods. When the
- non-terminal is used on the left-hand side of a production, it is
- considered to be declared and its syntax follows the Java syntax. On
- the right-hand side, its use is similar to a method call in Java.
- </p>
- <p>
- Each production defines its left-hand side non-terminal followed by a
- colon. This is followed by a bunch of declarations and statements
- within braces (in both cases in the above example, there are no
- declarations and hence this appears as "{}") which are generated as
- common declarations and statements into the generated method. This is
- then followed by a set of expansions also enclosed within braces.
- </p>
- <p>
- Lexical tokens (regular expressions) in a JavaCC input grammar are
- either simple strings ("{", "}", "\n", and "\r" in the above example),
- or a more complex regular expression. In our example above, there is
- one such regular expression "<EOF>" which is matched by the end of
- file. All complex regular expressions are enclosed within angular
- brackets.
- </p>
- <p>
- The first production above says that the non-terminal "Input" expands
- to the non-terminal "MethodBraces" followed by zero or more line
- terminators ("\n" or "\r") and then the end of file.
- </p>
- <p>
- The second production above says that the non-terminal "MatchedBraces"
- expands to the token "{" followed by an optional nested expansion of
- "MatchedBraces" followed by the token "}". Square brackets [...]
- in a JavaCC input file indicate that the ... is optional.
- </p>
- <p>
- [...] may also be written as (...)?. These two forms are equivalent.
- Other structures that may appear in expansions are:
- </p>
- <pre>
- e1 | e2 | e3 | ... : A choice of e1, e2, e3, etc.
- ( e )+ : One or more occurrences of e
- ( e )* : Zero or more occurrences of e
- </pre>
- <p>
- Note that these may be nested within each other, so we can have
- something like:
- </p>
- <pre>
- (( e1 | e2 )* [ e3 ] ) | e4
- </pre>
- <p>
- To build this parser, simply run JavaCC on this file and compile the
- resulting Java files:
- <pre>
- javacc Simple1.jj
- javac *.java
- </pre>
- <p>
- Now you should be able to run the generated parser. Make sure that
- the current directory is in your CLASSPATH and type:
- </p>
- <pre>
- java Simple1
- </pre>
- <p>
- Now type a sequence of matching braces followed by a return and an end
- of file (CTRL-D on UNIX machines). If this is a problem on your
- machine, you can create a file and pipe it as input to the generated
- parser in this manner (piping also does not work on all machines - if
- this is a problem, just replace "System.in" in the grammar file with
- 'new FileInputStream("testfile")' and place your input inside this
- file):
- </p>
- <pre>
- java Simple1 < myfile
- </pre>
- <p>
- Also try entering illegal sequences such as mismatched braces, spaces,
- and carriage returns between braces as well as other characters and
- take a look at the error messages produced by the parser.
- </p>
- <h3>DETAILED DESCRIPTION OF Simple2.jj</h3>
- <p>
- Simple2.jj is a minor modification to Simple1.jj to allow white space
- characters to be interspersed among the braces. So then input such
- as:
- </p>
- <pre>
- "{{ }\n}\n\n"
- </pre>
- <p>
- will now be legal.
- </p>
- <p>
- Take a look at Simple2.jj. The first thing you will note is that we
- have omitted the options section. This does not change anything since
- the options in Simple1.jj were all assigned their default values.
- </p>
- <p>
- The other difference between this file and Simple1.jj is that this
- file contains a lexical specification - the region that starts with
- "SKIP". Within this region are 4 regular expressions - space, tab,
- newline, and return. This says that matches of these regular
- expressions are to be ignored (and not considered for parsing). Hence
- whenever any of these 4 characters are encountered, they are just
- thrown away.
- </p>
- <p>
- In addition to SKIP, JavaCC has three other lexical specification
- regions. These are:
- </p>
- <pre>
- . TOKEN: This is used to specify lexical tokens (see next example)
- . SPECIAL_TOKEN: This is used to specify lexical tokens that are to be
- ignored during parsing. In this sense, SPECIAL_TOKEN is
- the same as SKIP. However, these tokens can be recovered
- within parser actions to be handled appropriately.
- . MORE: This specifies a partial token. A complete token is
- made up of a sequence of MORE's followed by a TOKEN
- or SPECIAL_TOKEN.
- </pre>
- <p>
- Please take a look at some of the more complex grammars such as the
- Java grammars for examples of usage of these lexical specification
- regions.
- </p>
- <p>
- You may build Simple2 and invoke the generated parser with input from
- the keyboard as standard input.
- </p>
- <p>
- You can also try generating the parser with the various debug options
- turned on and see what the output looks like. To do this type:
- </p>
- <pre>
- javacc -debug_parser Simple2.jj
- javac Simple2*.java
- java Simple2
- </pre>
- <p>
- Then type:
- </p>
- <pre>
- javacc -debug_token_manager Simple2.jj
- javac Simple2*.java
- java Simple2
- </pre>
- <p>
- Note that token manager debugging produces a lot of diagnostic
- information and it is typically used to look at debug traces a single
- token at a time.
- </p>
- <h3>DETAILED DESCRIPTION OF Simple3.jj</h3>
- <p>
- Simple3.jj is the third and final version of our matching brace
- detector. This example illustrates the use of the TOKEN region for
- specifying lexical tokens. In this case, "{" and "}" are defined as
- tokens and given names LBRACE and RBRACE respectively. These labels
- can then be used within angular brackets (as in the example) to refer
- to this token. Typically such token specifications are used for
- complex tokens such as identifiers and literals. Tokens that are
- simple strings are left as is (in the previous examples).
- </p>
- <p>
- This example also illustrates the use of actions in the grammar
- productions. The actions inserted in this example count the number of
- matching braces. Note the use of the declaration region to declare
- variables "count" and "nested_count". Also note how the non-terminal
- "MatchedBraces" returns its value as a function return value.
- </p>
- <h3>DETAILED DESCRIPTION OF NL_Xlator.jj</h3>
- <p>
- This example goes into the details of writing regular expressions in
- JavaCC grammar files. It also illustrates a slightly more complex set
- of actions that translate the expressions described by the grammar
- into English.
- </p>
- <p>
- The new concept in the above example is the use of more complex
- regular expressions. The regular expression:
- </p>
- <pre>
- < ID: ["a"-"z","A"-"Z","_"] ( ["a"-"z","A"-"Z","_","0"-"9"] )* >
- </pre>
- <p>
- creates a new regular expression whose name is ID. This can be
- referred anywhere else in the grammar simply as <ID>. What follows in
- square brackets are a set of allowable characters - in this case it is
- any of the lower or upper case letters or the underscore. This is
- followed by 0 or more occurrences of any of the lower or upper case
- letters, digits, or the underscore.
- </p>
- <p>
- Other constructs that may appear in regular expressions are:
- </p>
- <pre>
- ( ... )+ : One or more occurrences of ...
- ( ... )? : An optional occurrence of ... (Note that in the case
- of lexical tokens, (...)? and [...] are not equivalent)
- ( r1 | r2 | ... ) : Any one of r1, r2, ...
- </pre>
- <p>
- A construct of the form [...] is a pattern that is matched by the
- characters specified in ... . These characters can be individual
- characters or character ranges. A "~" before this construct is a
- pattern that matches any character not specified in ... . Therefore:
- </p>
- <pre>
- ["a"-"z"] matches all lower case letters
- ~[] matches any character
- ~["\n","\r"] matches any character except the new line characters
- </pre>
- <p>
- When a regular expression is used in an expansion, it takes a value of
- type "Token". This is generated into the generated parser directory
- as "Token.java". In the above example, we have defined a variable of
- type "Token" and assigned the value of the regular expression to it.
- </p>
- <h3>DETAILED DESCRIPTION OF IdList.jj</h3>
- <p>
- This example illustrates an important attribute of the SKIP
- specification. The main point to note is that the regular expressions
- in the SKIP specification are only ignored *between tokens* and not
- *within tokens*. This grammar accepts any sequence of identifiers
- with white space in between.
- </p>
- <p>
- A legal input for this grammar is:
- </p>
- <pre>
- "abc xyz123 A B C \t\n aaa"
- </pre>
- <p>
- This is because any number of the SKIP regular expressions are allowed
- in between consecutive <Id>'s. However, the following is not a legal
- input:
- </p>
- <pre>
- "xyz 123"
- </pre>
- <p>
- This is because the space character after "xyz" is in the SKIP
- category and therefore causes one token to end and another to begin.
- This requires "123" to be a separate token and hence does not match
- the grammar.
- </p>
- <p>
- If spaces were OK within <Id>'s, then all one has to do is to replace
- the definition of Id to:
- </p>
- <pre>
- TOKEN :
- {
- < Id: ["a"-"z","A"-"Z"] ( (" ")* ["a"-"z","A"-"Z","0"-"9"] )* >
- }
- </pre>
- <p>
- Note that having a space character within a TOKEN specification does
- not mean that the space character cannot be used in the SKIP
- specification. All this means is that any space character that
- appears in the context where it can be placed within an identifier
- will participate in the match for <Id>, whereas all other space
- characters will be ignored. The details of the matching algorithm are
- described in the JavaCC documentation.
- </p>
- <p>
- As a corollary, one must define as tokens anything within which
- characters such as white space characters must not be present. In the
- above example, if <Id> was defined as a grammar production rather than
- a lexical token as shown below this paragraph, then "xyz 123" would
- have been recognized as a legitimate <Id> (wrongly).
- </p>
- <pre>
- void Id() :
- {}
- {
- <["a"-"z","A"-"Z"]> ( <["a"-"z","A"-"Z","0"-"9"]> )*
- }
- </pre>
- <p>
- Note that in the above definition of non-terminal Id, it is made up of
- a sequence of single character tokens (note the location of <...>s),
- and hence white space is allowed between these characters.
- </p>
- </body>
- </html>