/lib/antlr-2.7.5/doc/glossary.html
HTML | 640 lines | 499 code | 139 blank | 2 comment | 0 complexity | b1340df74c4c5311b3f0ca52ade8bea8 MD5 | raw file
Possible License(s): GPL-2.0
- <html>
- <head>
- <title>ANTLR-centric Language Glossary</title>
- </head>
- <body bgcolor="#FFFFFF" text="#000000">
- <font face=Arial>
- <h1><a id="ANTLR-centric_Language_Glossary" name="ANTLR-centric_Language_Glossary">ANTLR-centric Language Glossary</a></h1>
- <i>Terence Parr</i>
- <br><br>
- This glossary defines some of the more important terms used in the
- ANTLR documentation. I have tried to be very informal and provide
- references to other pages that are useful. For another great source
- of information about formal computer languages, see <a
- href="http://www.wikipedia.org/wiki/Context-free_grammar"><b>Wikipedia</b></a>.
- <font size=2>
- <!--index-->
- <ul>
- <li><a href="#Ambiguous">Ambiguous</a></li>
- <li><a href="#ANTLR">ANTLR</a></li>
- <li><a href="#AST">AST</a></li>
- <li><a href="#Bit_set">Bit set</a></li>
- <li><a href="#Child-sibling_Tree">Child-sibling Tree</a></li>
- <li><a href="#Context-free_grammar">Context-free grammar</a></li>
- <li><a href="#Context-sensitive">Context-sensitive</a></li>
- <li><a href="#DFA">DFA</a></li>
- <li><a href="#FIRST">FIRST</a></li>
- <li><a href="#FOLLOW">FOLLOW</a></li>
- <li><a href="#Grammar">Grammar</a></li>
- <li><a href="#Hoisting">Hoisting</a></li>
- <li><a href="#Inheritance,_grammar">Inheritance, grammar</a></li>
- <li><a href="#LA(n)">LA(n)</a></li>
- <li><a href="#Left-prefix,_left_factor">Left-prefix, left factor</a></li>
- <li><a href="#Literal">Literal</a></li>
- <li><a href="#Linear_approximate_lookahead">Linear approximate lookahead</a></li>
- <li><a href="#LL(k)">LL(k)</a></li>
- <li><a href="#LT(n)">LT(n)</a></li>
- <li><a href="#Language">Language</a></li>
- <li><a href="#Lexer">Lexer</a></li>
- <li><a href="#Lookahead">Lookahead</a></li>
- <li><a href="#nextToken">nextToken</a></li>
- <li><a href="#NFA">NFA</a></li>
- <li><a href="#Nondeterministic">Nondeterministic</a></li>
- <li><a href="#Parser">Parser</a></li>
- <li><a href="#Predicate,_semantic">Predicate, semantic</a></li>
- <li><a href="#Predicate,_syntactic">Predicate, syntactic</a></li>
- <li><a href="#Production">Production</a></li>
- <li><a href="#Protected">Protected</a></li>
- <li><a href="#Recursive-descent">Recursive-descent</a></li>
- <li><a href="#Regular">Regular</a></li>
- <li><a href="#Rule">Rule</a></li>
- <li><a href="#Scanner">Scanner</a></li>
- <li><a href="#Semantics">Semantics</a></li>
- <li><a href="#Subrule">Subrule</a></li>
- <li><a href="#Syntax">Syntax</a></li>
- <li><a href="#Token">Token</a></li>
- <li><a href="#Token_stream">Token stream</a></li>
- <li><a href="#Tree">Tree</a></li>
- <li><a href="#Tree_parser">Tree parser</a></li>
- <li><a href="#Vocabulary">Vocabulary</a></li>
- <li><a href="#Wow">Wow</a></li>
- </ul>
- <!--/index-->
- </font>
- <h3><a id="Ambiguous" name="Ambiguous">Ambiguous</a></h3>
- A language is ambiguous if the same sentence or phrase can be
- interpreted in more than a single way. For example, the following
- sentence by Groucho Marx is easily interpreted in two ways: "I once
- shot an elephant in my pajamas. How he got in my pajamas I'll never
- know!" In the computer world, a typical language ambiguity is the
- if-then-else ambiguity where the else-clause may be attached to either
- the most recent if-then or an older one. Reference manuals for
- computer languages resolve this ambiguity by stating that else-clauses
- always match up with the most recent if-then.
- <p>
- A grammar is ambiguous if the same input sequence can be derived in
- multiple ways. Ambiguous languages always yield ambiguous grammars
- unless you can find a way to encode semantics (actions or predicates
- etc...) that resolve the ambiguity. Most language tools like ANTLR
- resolve the if-then-else ambiguity by simply choosing to match
- greedily (i.e., as soon as possible). This matches the else with the
- most recent if-then. See nondeterministic.
- <h3><a id="ANTLR" name="ANTLR">ANTLR</a></h3>
- <b>AN</b>other <b>T</b>ool for <b>L</b>anguage <b>R</b>ecognition, a
- predicated-LL(k) parser generator that handles lexers, parsers, and
- tree parsers. ANTLR has been available since 1990 and led to a
- resurgence of recursive-descent parsing after decades dominated by LR
- and other DFA-based strategies.
- <h3><a id="AST" name="AST">AST</a></h3>
- <b>A</b>bstract <b>S</b>yntax <b>T</b>ree. ASTs are used as internal
- representations of an input stream, normally constructed during a
- parsing phase. Because AST are two-dimensional trees they
- can encode the structure (as determined by the parser) of the input as
- well as the input symbols.
- <p>
- A homogeneous AST is in one in which the physical objects are all of
- the same type; e.g., <tt>CommonAST</tt> in ANTLR. A heterogeneous
- tree may have multiple types such as <tt>PlusNode</tt>,
- <tt>MultNode</tt> etc...
- <p> An AST is not a parse tree, which encodes the sequence of rules
- used to match input symbols. See <a
- href="http://www.jguru.com/faq/view.jsp?EID=814505"><b>What's the
- difference between a parse tree and an abstract syntax tree (AST)? Why
- doesn't ANTLR generate trees with nodes for grammar rules like JJTree
- does?</b></a>.
- <p>An AST for input <tt>3+4</tt> might be represented as
- <pre>
- +
- / \
- 3 4
- </pre>
- or more typically (ala ANTLR) in child-sibling form:
- <pre>
- +
- |
- 3--4
- </pre>
- Operators are usually subtree roots and operands are usually leaves.
- <h3><a id="Bit_set" name="Bit_set">Bit set</a></h3>
- Bit sets are an extremely efficient representation for dense integer
- sets. You can easily encode sets of strings also by mapping unique
- strings to unique integers. ANTLR uses bitsets for lookahead
- prediction in parsers and lexers. Simple bit set implementations do
- not work so well for sparse sets, particularly when the maximum
- integer stored in the set is large.
- <p> ANTLR's bit set represents membership with a bit for each possible
- integer value. For a maximum value of <i>n</i>, a bit set needs
- <i>n/64</i> long words or <i>n/8</i> bytes. For ASCII bit sets with a
- maximum value of 127, you only need 16 bytes or 2 long words. UNICODE
- has a max value of \uFFFF or 65535, requiring 8k bytes, and these sets
- are typically sparse. Fortunately most lexers only need a few of
- these space inefficient (but speedy) bitsets and so it's not really a
- problem.
- <h3><a id="Child-sibling_Tree" name="Child-sibling_Tree">Child-sibling Tree</a></h3>
- A particularly efficient data structure for representing trees. See AST.
- <h3><a id="Context-free_grammar" name="Context-free_grammar">Context-free grammar</a></h3>
- A grammar where recognition of a particular construct does not depend
- on whether it is in a particular syntactic context. A context-free
- grammar has a set of rules like
- <pre>
- stat : IF expr THEN stat
- | ...
- ;
- </pre>
- where there is no restriction on when the <tt>IF</tt> alternative may
- be applied--if you are in rule <tt>stat</tt>, you may apply the
- alternative.
- <h3><a id="Context-sensitive" name="Context-sensitive">Context-sensitive</a></h3>
- A grammar where recognition of a particular construct may depend on a
- syntactic context. You never see these grammars in practice because
- they are impractical (Note, an Earley parser is O(n^3) worst-case for
- context-<i>free</i> grammars). A context-free rule looks like:
- <pre>
- Α → γ
- </pre>
- but a context-<i>sensitive</i> rule may have context on the left-side:
- <pre>
- αΑβ → αγβ
- </pre>
- meaning that rule Α may only be applied (converted to γ)
- in between α and β.
- <p>
- In an ANTLR sense, you can recognize context-sensitive constructs with
- a semantic predicate. The action evaluates to true or false
- indicating the validity of applying the alternative.
- <p>
- See <a
- href="http://www.wikipedia.org/wiki/Context-sensitive"><b>Context-sensitive gramar</b></a>.
- <h3><a id="DFA" name="DFA">DFA</a></h3>
- <b>D</b>eterministic <b>F</b>inite <b>A</b>utomata. A state machine
- used typically to formally describe lexical analyzers. <tt>lex</tt>
- builds a DFA to recognize tokens whereas ANTLR builds a recursive
- descent lexer similar to what you would build by hand. See <a
- href="http://www.wikipedia.org/wiki/Finite_state_machine"><b>Finite
- state machine</b></a> and ANTLR's lexer documentation.
- <h3><a id="FIRST" name="FIRST">FIRST</a></h3>
- The set of symbols that may be matched on the left-edge of a rule.
- For example, the FIRST(decl) is set {ID, INT} for the following:
- <pre>
- decl : ID ID SEMICOLON
- | INT ID SEMICOLON
- ;
- </pre>
- The situation gets more complicated when you have optional
- constructs. The FIRST(a) below is {A,B,C}
- <pre>
- a : (A)? B
- | C
- ;
- </pre>
- because the A is optional and the B may be seen on the left-edge.
- <p>
- Naturally k>1 lookahead symbols makes this even more complicated.
- FIRST_k must track sets of k-sequences not just individual symbols.
- <h3><a id="FOLLOW" name="FOLLOW">FOLLOW</a></h3>
- The set of input symbols that may follow any reference to the
- specified rule. For example, FOLLOW(decl) is {RPAREN, SEMICOLON):
- <pre>
- methodHead : ID LPAREN decl RPAREN ;
- var : decl SEMICOLON ;
- decl : TYPENAME ID ;
- </pre>
- because RPAREN and SEMICOLON both follow references to rule decl.
- FIRST and FOLLOW computations are used to analyze grammars and
- generate parsing decisions.
- <p>
- This grammar analysis all gets very complicated when k>1.
- <h3><a id="Grammar" name="Grammar">Grammar</a></h3>
- A finite means of formally describing the structure of a possibly
- infinite language. Parser generators build parsers that recognize
- sentences in the language described by a grammar. Most parser
- generators allow you to add actions to be executed during the parse.
- <h3><a id="Hoisting" name="Hoisting">Hoisting</a></h3>
- Semantic predicates describe the semantic context in which a rule or
- alternative applies. The predicate is hoisted into a prediction
- expression. Hoisting typically refers to pulling a predicate out of
- its enclosing rule and into the prediction expression of another
- rule. For example,
- <pre>
- decl : typename ID SEMICOLON
- | ID ID SEMICOLON
- ;
- typename : {isType(LT(1))}? ID
- ;
- </pre>
- The predicate is not needed in typename as there is no decision,
- however, rule decl needs it to distinguish between its two
- alternatives. The first alternative would look like:
- <pre>
- if ( LA(1)==ID && isType(LT(1)) ) {
- typename();
- match(ID);
- match(SEMICOLON);
- }
- </pre>
- PCCTS 1.33 did, but ANTLR currently does not hoist predicates into
- other rules.
- <h3><a id="Inheritance,_grammar" name="Inheritance,_grammar">Inheritance, grammar</a></h3>
- The ability of ANTLR to define a new grammar as it differs from an
- existing grammar. See the ANTLR documentation.
- <h3><a id="LA(n)" name="LA(n)">LA(n)</a></h3>
- The nth lookahead character, token type, or AST node type depending
- on the grammar type (lexer, parser, or tree parser respectively).
- <h3><a id="Left-prefix,_left_factor" name="Left-prefix,_left_factor">Left-prefix, left factor</a></h3>
- A common sequence of symbols on the left-edge of a set of alternatives
- such as:
- <pre>
- a : A B X
- | A B Y
- ;
- </pre>
- The left-prefix is A B, which you can remove by left-factoring:
- <pre>
- a : A B (X|Y)
- ;
- </pre>
- Left-factoring is done to reduce lookahead requirements.
- <h3><a id="Literal" name="Literal">Literal</a></h3>
- Generally a literal refers to a fixed string such as <tt>begin</tt>
- that you wish to match. When you reference a literal in an ANTLR
- grammar via <tt>"begin"</tt>, ANTLR assigns it a token type like any
- other token. If you have defined a lexer, ANTLR provides information
- about the literal (type and text) to the lexer so it may detect
- occurrences of the literal.
- <h3><a id="Linear_approximate_lookahead" name="Linear_approximate_lookahead">Linear approximate lookahead</a></h3>
- An approximation to full lookahead (that can be applied to both LL and LR
- parsers) for k>1 that reduces the complexity of storing and testing
- lookahead from O(n^k) to O(nk); exponential to linear reduction. When
- linear approximate lookahead is insufficient (results in a
- nondeterministic parser), you can use the approximate lookahead to
- attenuate the cost of building the full decision.
- <p>Here is a simple example illustrating the difference between full
- and approximate lookahead:
- <pre>
- a : (A B | C D)
- | A D
- ;
- </pre>
- This rule is LL(2) but not linear approximate LL(2). The real
- FIRST_2(a) is {AB,CD} for alternative 1 and {AD} for alternative 2.
- No intersection, so no problem. Linear approximate lookahead
- collapses all symbols at depth i yielding k sets instead of a possibly
- n^k k-sequences. The approximation (compressed) sets are
- {AB,AD,CD,CB} and {AD}. Note the introduction of the spurious
- k-sequences AD and CB. Unfortunately, this compression introduces a
- conflict upon AD between the alternatives. PCCTS did full LL(k) and
- ANTLR does linear approximate only as I found that linear approximate
- lookahead works for the vast majority of parsing decisions and is
- extremely fast. I find one or two problem spots in a large grammar
- usually with ANTLR, which forces me to reorganize my grammar in a
- slightly unnatural manner. Unfortunately, your brain does full LL(k)
- and ANTLR does a slightly weaker linear approximate lookahead--a
- source of many (invalid) bug reports ;)
- <p>
- This compression was the subject of <a href="http://www.antlr.org/papers/parr.phd.thesis.pdf"><b>my doctoral
- dissertation</b></a> (PDF 477k) at Purdue.
- <h3><a id="LL(k)" name="LL(k)">LL(k)</a></h3>
- Formally, LL(k) represents a class of parsers and grammars that parse
- symbols from left-to-right (beginning to end of input stream) using a
- leftmost derivation and using k symbols of lookahead. A leftmost
- derivation is one in which derivations (parses) proceed by attempting
- to replace rule references from left-to-right within a production.
- Given the following rule
- <pre>
- stat : IF expr THEN stat
- | ...
- ;
- </pre>
- an LL parser would match the IF then attempt to parse expr rather than
- a rightmost derivation, which would attempt to parse stat first.
- <p>
- LL(k) is synonymous with a "top-down" parser because the
- parser begins at the start symbol and works its way down the
- derivation/parse tree (tree here means the stack of method activations
- for recursive descent or symbol stack for a table-driven parser). A
- recursive-descent parser is particular implementation of an LL parser
- that uses functions or method calls to implement the parser rather
- than a table.
- <p>
- ANTLR generates predicate-LL(k) parsers that support syntactic and
- sematic predicates allowing you to specify many context-free and
- context-sensitive grammars (with a bit of work).
- <h3><a id="LT(n)" name="LT(n)">LT(n)</a></h3>
- In a parser, this is the nth lookahead Token object.
- <h3><a id="Language" name="Language">Language</a></h3> A possibly infinite set of valid sentences. The
- vocabulary symbols may be characters, tokens, and tree nodes in an
- ANTLR context.
- <h3><a id="Lexer" name="Lexer">Lexer</a></h3>
- A recognizer that breaks up a stream of characters into
- vocabulary symbols for a parser. The parser pulls vocabulary symbols
- from the lexer via a queue.
- <h3><a id="Lookahead" name="Lookahead">Lookahead</a></h3>
- When parsing a stream of input symbols, a parser has matched (and no
- longer needs to consider) a portion of the stream to the left of its
- read pointer. The next k symbols to the right of the read pointer are
- considered the fixed lookahead. This information is used to direct
- the parser to the next state. In an LL(k) parser this means to
- predict which path to take from the current state using the next k
- symbols of lookahead.
- <p> ANTLR supports syntactic predicates, a manually-specified form of
- backtracking that effectively gives you infinite lookahead. For
- example, consider the following rule that distinguishes between sets
- (comma-separated lists of words) and parallel assignments (one list
- assigned to another):
- <pre>
- stat: ( list "=" )=> list "=" list
- | list
- ;
- </pre>
- If a list followed by an assignment operator is found on the input
- stream, the first production is predicted. If not, the second
- alternative production is attempted.
- <h3><a id="nextToken" name="nextToken">nextToken</a></h3>
- A lexer method automatically generated by ANTLR that figures out which
- of the lexer rules to apply. For example, if you have two rules ID
- and INT in your lexer, ANTLR will generate a lexer with methods for ID
- and INT as well as a nextToken method that figures out which rule
- method to attempt given k input characters.
- <h3><a id="NFA" name="NFA">NFA</a></h3>
- <b>N</b>ondeterministic <b>F</b>inite <b>A</b>utomata. See <a
- href="http://www.wikipedia.org/wiki/Finite_state_machine"><b>Finite
- state machine</b></a>.
- <h3><a id="Nondeterministic" name="Nondeterministic">Nondeterministic</a></h3>
- A parser is nondeterministic if there is at least one decision point
- where the parser cannot resolve which path to take. Nondeterminisms
- arise because of parsing strategy weaknesses.
- <ul>
- <li>If your strategy works only for unambiguous grammars, then
- ambiguous grammars will yield nondeterministic parsers; this is true
- of the basic LL, LR strategies. Even unambiguous grammars can yield
- nondeterministic parsers though. Here is a nondeterministic LL(1)
- grammar:
- <pre>
- decl : ID ID SEMICOLON
- | ID SEMICOLON
- ;
- </pre>
- Rule <tt>decl</tt> is, however, LL(2) because the second lookahead
- symbol (either ID or SEMICOLON) uniquely determines which alternative
- to predict. You could also left-factor the rule to reduce the
- lookahead requirements.<br><br>
- <li>
- If you are willing to pay a
- performance hit or simply need to handle ambiguous grammars, you can
- use an Earley parser or a Tomita parser (LR-based) that match all
- possible interpretations of the input, thus, avoiding the idea of
- nondeterminism altogether. This does present problems when trying to
- execute actions, however, because multiple parses are, in effect,
- occurring in parallel.
- </ul>
- <p>
- Note that a parser may have multiple decision points that are
- nondeterministic.
- <h3><a id="Parser" name="Parser">Parser</a></h3>
- A recognizer that applies a grammatical structure to a stream
- of vocabulary symbols called tokens.
- <h3><a id="Predicate,_semantic" name="Predicate,_semantic">Predicate, semantic</a></h3>
- A semantic predicate is a boolean expression used to alter the parse
- based upon semantic information. This information is usually a
- function of the constructs/input that have already been matched, but
- can even be a flag that turns on and off subsets of the language (as
- you might do for a grammar handling both K&R and ANSI C). One of the
- most common semantic predicates uses a symbol table to help
- distinguish between syntactically, but semantically different
- productions. In FORTRAN, array references and function calls look the
- same, but may be distinguished by checking what the type of the
- identifier is.
- <pre>
- expr : {isVar(LT(1))}? ID LPAREN args RPAREN // array ref
- | {isFunction(LT(1))}? ID LPAREN args RPAREN // func call
- ;
- </pre>
- <h3><a id="Predicate,_syntactic" name="Predicate,_syntactic">Predicate, syntactic</a></h3>
- A selective form of backtracking used to recognize language constructs
- that cannot be distinguished without seeing all or most of the
- construct. For example, in C++ some declarations look exactly like
- expressions. You have to check to see if it is a declaration. If it
- parses like a declaration, assume it is a declaration--reparse it with
- "feeling" (execute your actions). If not, it must be an expression or
- an error:
- <pre>
- stat : (declaration) => declaration
- | expression
- ;
- </pre>
- <h3><a id="Production" name="Production">Production</a></h3>
- An alternative in a grammar rule.
- <h3><a id="Protected" name="Protected">Protected</a></h3>
- A protected lexer rule does not represent a complete token--it is a
- helper rule referenced by another lexer rule. This overloading of the
- access-visibility Java term occurs because if the rule is not visible,
- it cannot be "seen" by the parser (yes, this nomeclature sucks).
- <h3><a id="Recursive-descent" name="Recursive-descent">Recursive-descent</a></h3>
- See LL(k).
- <h3><a id="Regular" name="Regular">Regular</a></h3>
- A regular language is one that can be described by a regular grammar
- or regular expression or accepted by a DFA-based lexer such as those
- generated by <tt>lex</tt>. Regular languages are normally used to
- describe tokens.
- <p> In practice you can pick out a regular grammar by noticing that
- references to other rules are not allowed accept at the end of a
- production. The following grammar is regular because reference to
- <tt>B</tt> occurs at the right-edge of rule <tt>A</tt>.
- <pre>
- A : ('a')+ B ;
- B : 'b' ;
- </pre>
- Another way to look at it is, "what can I recognize without a stack
- (such as a method return address stack)?".
- <p>
- Regular grammars cannot describe context-free languages, hence, LL or
- LR based grammars are used to describe programming languages. ANTLR
- is not restricted to regular languages for tokens because it generates
- recursive-descent lexers. This makes it handy to recognize HTML tags
- and so on all in the lexer.
- <h3><a id="Rule" name="Rule">Rule</a></h3>
- A rule describes a partial sentence in a language such as a statement
- or expression in a programming language. Rules may have one or more
- alternative productions.
- <h3><a id="Scanner" name="Scanner">Scanner</a></h3>
- See Lexer.
- <h3><a id="Semantics" name="Semantics">Semantics</a></h3> See <a href="http://www.jguru.com/faq/view.jsp?EID=81"><b>What
- do "syntax" and "semantics" mean and how are they different?</b></a>.
- <h3><a id="Subrule" name="Subrule">Subrule</a></h3>
- Essentially a rule that has been expanded inline. Subrules are
- enclosed in parenthesis and may have suffixes like star, plus, and
- question mark that indicate zero-or-more, one-or-more, or optional.
- The following rule has 3 subrules:
- <pre>
- a : (A|B)+ (C)* (D)?
- ;
- </pre>
- <h3><a id="Syntax" name="Syntax">Syntax</a></h3>
- See <a href="http://www.jguru.com/faq/view.jsp?EID=81"><b>What
- do "syntax" and "semantics" mean and how are they different?</b></a>.
- <h3><a id="Token" name="Token">Token</a></h3>
- A vocabulary symbol for a language. This term typically refers to the
- vocabulary symbols of a parser. A token may represent a constant
- symbol such as a keyword like <tt>begin</tt> or a "class" of input
- symbols like <tt>ID</tt> or <tt>INTEGER_LITERAL</tt>.
- <h3><a id="Token_stream" name="Token_stream">Token stream</a></h3>
- See <a href="http://www.antlr.org/doc/streams.html"><b>Token
- Streams</b></a> in the ANTLR documentation.
- <h3><a id="Tree" name="Tree">Tree</a></h3>
- See AST and <a
- href="http://www.jguru.com/faq/view.jsp?EID=814505"><b>What's the
- difference between a parse tree and an abstract syntax tree (AST)? Why
- doesn't ANTLR generate trees with nodes for grammar rules like JJTree
- does?</b></a>.
-
- <h3><a id="Tree_parser" name="Tree_parser">Tree parser</a></h3> A recognizer that applies a grammatical structure to a
- two-dimensional input tree. Grammatical rules are like an "executable
- comment" that describe the tree structure. These parsers are useful
- during translation to (i) annotate trees with, for example, symbol
- table information, (2) perform tree rewrites, and (3) generate output.
- <h3><a id="Vocabulary" name="Vocabulary">Vocabulary</a></h3>
- The set of symbols used to construct sentences in a language. These
- symbols are usually called tokens or token types. For lexers, the
- vocabulary is a set of characters.
- <h3><a id="Wow" name="Wow">Wow</a></h3> See ANTLR.
- </font>
- </body>
- </html>