PageRenderTime 36ms CodeModel.GetById 22ms app.highlight 7ms RepoModel.GetById 1ms app.codeStats 0ms

/3rd_party/llvm/docs/tutorial/LangImpl2.html

https://code.google.com/p/softart/
HTML | 1231 lines | 1004 code | 208 blank | 19 comment | 0 complexity | 22a89bd7a53820a7bb9544055abd23b2 MD5 | raw file
Possible License(s): LGPL-2.1, BSD-3-Clause, JSON, MPL-2.0-no-copyleft-exception, GPL-2.0, GPL-3.0, LGPL-3.0, BSD-2-Clause
   1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                      "http://www.w3.org/TR/html4/strict.dtd">
   3
   4<html>
   5<head>
   6  <title>Kaleidoscope: Implementing a Parser and AST</title>
   7  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   8  <meta name="author" content="Chris Lattner">
   9  <link rel="stylesheet" href="../_static/llvm.css" type="text/css">
  10</head>
  11
  12<body>
  13
  14<h1>Kaleidoscope: Implementing a Parser and AST</h1>
  15
  16<ul>
  17<li><a href="index.html">Up to Tutorial Index</a></li>
  18<li>Chapter 2
  19  <ol>
  20    <li><a href="#intro">Chapter 2 Introduction</a></li>
  21    <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
  22    <li><a href="#parserbasics">Parser Basics</a></li>
  23    <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
  24    <li><a href="#parserbinops">Binary Expression Parsing</a></li>
  25    <li><a href="#parsertop">Parsing the Rest</a></li>
  26    <li><a href="#driver">The Driver</a></li>
  27    <li><a href="#conclusions">Conclusions</a></li>
  28    <li><a href="#code">Full Code Listing</a></li>
  29  </ol>
  30</li>
  31<li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
  32</ul>
  33
  34<div class="doc_author">
  35  <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
  36</div>
  37
  38<!-- *********************************************************************** -->
  39<h2><a name="intro">Chapter 2 Introduction</a></h2>
  40<!-- *********************************************************************** -->
  41
  42<div>
  43
  44<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
  45with LLVM</a>" tutorial.  This chapter shows you how to use the lexer, built in 
  46<a href="LangImpl1.html">Chapter 1</a>, to build a full <a
  47href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
  48our Kaleidoscope language.  Once we have a parser, we'll define and build an <a 
  49href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax 
  50Tree</a> (AST).</p>
  51
  52<p>The parser we will build uses a combination of <a 
  53href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
  54Parsing</a> and <a href=
  55"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence 
  56Parsing</a> to parse the Kaleidoscope language (the latter for 
  57binary expressions and the former for everything else).  Before we get to
  58parsing though, lets talk about the output of the parser: the Abstract Syntax
  59Tree.</p>
  60
  61</div>
  62
  63<!-- *********************************************************************** -->
  64<h2><a name="ast">The Abstract Syntax Tree (AST)</a></h2>
  65<!-- *********************************************************************** -->
  66
  67<div>
  68
  69<p>The AST for a program captures its behavior in such a way that it is easy for
  70later stages of the compiler (e.g. code generation) to interpret.  We basically
  71want one object for each construct in the language, and the AST should closely
  72model the language.  In Kaleidoscope, we have expressions, a prototype, and a
  73function object.  We'll start with expressions first:</p>
  74
  75<div class="doc_code">
  76<pre>
  77/// ExprAST - Base class for all expression nodes.
  78class ExprAST {
  79public:
  80  virtual ~ExprAST() {}
  81};
  82
  83/// NumberExprAST - Expression class for numeric literals like "1.0".
  84class NumberExprAST : public ExprAST {
  85  double Val;
  86public:
  87  NumberExprAST(double val) : Val(val) {}
  88};
  89</pre>
  90</div>
  91
  92<p>The code above shows the definition of the base ExprAST class and one
  93subclass which we use for numeric literals.  The important thing to note about
  94this code is that the NumberExprAST class captures the numeric value of the
  95literal as an instance variable. This allows later phases of the compiler to
  96know what the stored numeric value is.</p>
  97
  98<p>Right now we only create the AST,  so there are no useful accessor methods on
  99them.  It would be very easy to add a virtual method to pretty print the code,
 100for example.  Here are the other expression AST node definitions that we'll use
 101in the basic form of the Kaleidoscope language:
 102</p>
 103
 104<div class="doc_code">
 105<pre>
 106/// VariableExprAST - Expression class for referencing a variable, like "a".
 107class VariableExprAST : public ExprAST {
 108  std::string Name;
 109public:
 110  VariableExprAST(const std::string &amp;name) : Name(name) {}
 111};
 112
 113/// BinaryExprAST - Expression class for a binary operator.
 114class BinaryExprAST : public ExprAST {
 115  char Op;
 116  ExprAST *LHS, *RHS;
 117public:
 118  BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) 
 119    : Op(op), LHS(lhs), RHS(rhs) {}
 120};
 121
 122/// CallExprAST - Expression class for function calls.
 123class CallExprAST : public ExprAST {
 124  std::string Callee;
 125  std::vector&lt;ExprAST*&gt; Args;
 126public:
 127  CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 128    : Callee(callee), Args(args) {}
 129};
 130</pre>
 131</div>
 132
 133<p>This is all (intentionally) rather straight-forward: variables capture the
 134variable name, binary operators capture their opcode (e.g. '+'), and calls
 135capture a function name as well as a list of any argument expressions.  One thing 
 136that is nice about our AST is that it captures the language features without 
 137talking about the syntax of the language.  Note that there is no discussion about 
 138precedence of binary operators, lexical structure, etc.</p>
 139
 140<p>For our basic language, these are all of the expression nodes we'll define.
 141Because it doesn't have conditional control flow, it isn't Turing-complete;
 142we'll fix that in a later installment.  The two things we need next are a way
 143to talk about the interface to a function, and a way to talk about functions
 144themselves:</p>
 145
 146<div class="doc_code">
 147<pre>
 148/// PrototypeAST - This class represents the "prototype" for a function,
 149/// which captures its name, and its argument names (thus implicitly the number
 150/// of arguments the function takes).
 151class PrototypeAST {
 152  std::string Name;
 153  std::vector&lt;std::string&gt; Args;
 154public:
 155  PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 156    : Name(name), Args(args) {}
 157};
 158
 159/// FunctionAST - This class represents a function definition itself.
 160class FunctionAST {
 161  PrototypeAST *Proto;
 162  ExprAST *Body;
 163public:
 164  FunctionAST(PrototypeAST *proto, ExprAST *body)
 165    : Proto(proto), Body(body) {}
 166};
 167</pre>
 168</div>
 169
 170<p>In Kaleidoscope, functions are typed with just a count of their arguments.
 171Since all values are double precision floating point, the type of each argument
 172doesn't need to be stored anywhere.  In a more aggressive and realistic
 173language, the "ExprAST" class would probably have a type field.</p>
 174
 175<p>With this scaffolding, we can now talk about parsing expressions and function
 176bodies in Kaleidoscope.</p>
 177
 178</div>
 179
 180<!-- *********************************************************************** -->
 181<h2><a name="parserbasics">Parser Basics</a></h2>
 182<!-- *********************************************************************** -->
 183
 184<div>
 185
 186<p>Now that we have an AST to build, we need to define the parser code to build
 187it.  The idea here is that we want to parse something like "x+y" (which is
 188returned as three tokens by the lexer) into an AST that could be generated with
 189calls like this:</p>
 190
 191<div class="doc_code">
 192<pre>
 193  ExprAST *X = new VariableExprAST("x");
 194  ExprAST *Y = new VariableExprAST("y");
 195  ExprAST *Result = new BinaryExprAST('+', X, Y);
 196</pre>
 197</div>
 198
 199<p>In order to do this, we'll start by defining some basic helper routines:</p>
 200
 201<div class="doc_code">
 202<pre>
 203/// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 204/// token the parser is looking at.  getNextToken reads another token from the
 205/// lexer and updates CurTok with its results.
 206static int CurTok;
 207static int getNextToken() {
 208  return CurTok = gettok();
 209}
 210</pre>
 211</div>
 212
 213<p>
 214This implements a simple token buffer around the lexer.  This allows 
 215us to look one token ahead at what the lexer is returning.  Every function in
 216our parser will assume that CurTok is the current token that needs to be
 217parsed.</p>
 218
 219<div class="doc_code">
 220<pre>
 221
 222/// Error* - These are little helper functions for error handling.
 223ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 224PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 225FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 226</pre>
 227</div>
 228
 229<p>
 230The <tt>Error</tt> routines are simple helper routines that our parser will use
 231to handle errors.  The error recovery in our parser will not be the best and
 232is not particular user-friendly, but it will be enough for our tutorial.  These
 233routines make it easier to handle errors in routines that have various return
 234types: they always return null.</p>
 235
 236<p>With these basic helper functions, we can implement the first
 237piece of our grammar: numeric literals.</p>
 238
 239</div>
 240
 241<!-- *********************************************************************** -->
 242<h2><a name="parserprimexprs">Basic Expression Parsing</a></h2>
 243<!-- *********************************************************************** -->
 244
 245<div>
 246
 247<p>We start with numeric literals, because they are the simplest to process.
 248For each production in our grammar, we'll define a function which parses that
 249production.  For numeric literals, we have:
 250</p>
 251
 252<div class="doc_code">
 253<pre>
 254/// numberexpr ::= number
 255static ExprAST *ParseNumberExpr() {
 256  ExprAST *Result = new NumberExprAST(NumVal);
 257  getNextToken(); // consume the number
 258  return Result;
 259}
 260</pre>
 261</div>
 262
 263<p>This routine is very simple: it expects to be called when the current token
 264is a <tt>tok_number</tt> token.  It takes the current number value, creates 
 265a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally
 266returns.</p>
 267
 268<p>There are some interesting aspects to this.  The most important one is that
 269this routine eats all of the tokens that correspond to the production and
 270returns the lexer buffer with the next token (which is not part of the grammar
 271production) ready to go.  This is a fairly standard way to go for recursive
 272descent parsers.  For a better example, the parenthesis operator is defined like
 273this:</p>
 274
 275<div class="doc_code">
 276<pre>
 277/// parenexpr ::= '(' expression ')'
 278static ExprAST *ParseParenExpr() {
 279  getNextToken();  // eat (.
 280  ExprAST *V = ParseExpression();
 281  if (!V) return 0;
 282  
 283  if (CurTok != ')')
 284    return Error("expected ')'");
 285  getNextToken();  // eat ).
 286  return V;
 287}
 288</pre>
 289</div>
 290
 291<p>This function illustrates a number of interesting things about the 
 292parser:</p>
 293
 294<p>
 2951) It shows how we use the Error routines.  When called, this function expects
 296that the current token is a '(' token, but after parsing the subexpression, it
 297is possible that there is no ')' waiting.  For example, if the user types in
 298"(4 x" instead of "(4)", the parser should emit an error.  Because errors can
 299occur, the parser needs a way to indicate that they happened: in our parser, we
 300return null on an error.</p>
 301
 302<p>2) Another interesting aspect of this function is that it uses recursion by
 303calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
 304<tt>ParseParenExpr</tt>).  This is powerful because it allows us to handle 
 305recursive grammars, and keeps each production very simple.  Note that
 306parentheses do not cause construction of AST nodes themselves.  While we could
 307do it this way, the most important role of parentheses are to guide the parser
 308and provide grouping.  Once the parser constructs the AST, parentheses are not
 309needed.</p>
 310
 311<p>The next simple production is for handling variable references and function
 312calls:</p>
 313
 314<div class="doc_code">
 315<pre>
 316/// identifierexpr
 317///   ::= identifier
 318///   ::= identifier '(' expression* ')'
 319static ExprAST *ParseIdentifierExpr() {
 320  std::string IdName = IdentifierStr;
 321  
 322  getNextToken();  // eat identifier.
 323  
 324  if (CurTok != '(') // Simple variable ref.
 325    return new VariableExprAST(IdName);
 326  
 327  // Call.
 328  getNextToken();  // eat (
 329  std::vector&lt;ExprAST*&gt; Args;
 330  if (CurTok != ')') {
 331    while (1) {
 332      ExprAST *Arg = ParseExpression();
 333      if (!Arg) return 0;
 334      Args.push_back(Arg);
 335
 336      if (CurTok == ')') break;
 337
 338      if (CurTok != ',')
 339        return Error("Expected ')' or ',' in argument list");
 340      getNextToken();
 341    }
 342  }
 343
 344  // Eat the ')'.
 345  getNextToken();
 346  
 347  return new CallExprAST(IdName, Args);
 348}
 349</pre>
 350</div>
 351
 352<p>This routine follows the same style as the other routines.  (It expects to be
 353called if the current token is a <tt>tok_identifier</tt> token).  It also has
 354recursion and error handling.  One interesting aspect of this is that it uses
 355<em>look-ahead</em> to determine if the current identifier is a stand alone
 356variable reference or if it is a function call expression.  It handles this by
 357checking to see if the token after the identifier is a '(' token, constructing
 358either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
 359</p>
 360
 361<p>Now that we have all of our simple expression-parsing logic in place, we can
 362define a helper function to wrap it together into one entry point.  We call this
 363class of expressions "primary" expressions, for reasons that will become more
 364clear <a href="LangImpl6.html#unary">later in the tutorial</a>.  In order to
 365parse an arbitrary primary expression, we need to determine what sort of
 366expression it is:</p>
 367
 368<div class="doc_code">
 369<pre>
 370/// primary
 371///   ::= identifierexpr
 372///   ::= numberexpr
 373///   ::= parenexpr
 374static ExprAST *ParsePrimary() {
 375  switch (CurTok) {
 376  default: return Error("unknown token when expecting an expression");
 377  case tok_identifier: return ParseIdentifierExpr();
 378  case tok_number:     return ParseNumberExpr();
 379  case '(':            return ParseParenExpr();
 380  }
 381}
 382</pre>
 383</div>
 384
 385<p>Now that you see the definition of this function, it is more obvious why we
 386can assume the state of CurTok in the various functions.  This uses look-ahead
 387to determine which sort of expression is being inspected, and then parses it
 388with a function call.</p>
 389
 390<p>Now that basic expressions are handled, we need to handle binary expressions.
 391They are a bit more complex.</p>
 392
 393</div>
 394
 395<!-- *********************************************************************** -->
 396<h2><a name="parserbinops">Binary Expression Parsing</a></h2>
 397<!-- *********************************************************************** -->
 398
 399<div>
 400
 401<p>Binary expressions are significantly harder to parse because they are often
 402ambiguous.  For example, when given the string "x+y*z", the parser can choose
 403to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from
 404mathematics, we expect the later parse, because "*" (multiplication) has
 405higher <em>precedence</em> than "+" (addition).</p>
 406
 407<p>There are many ways to handle this, but an elegant and efficient way is to
 408use <a href=
 409"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence 
 410Parsing</a>.  This parsing technique uses the precedence of binary operators to
 411guide recursion.  To start with, we need a table of precedences:</p>
 412
 413<div class="doc_code">
 414<pre>
 415/// BinopPrecedence - This holds the precedence for each binary operator that is
 416/// defined.
 417static std::map&lt;char, int&gt; BinopPrecedence;
 418
 419/// GetTokPrecedence - Get the precedence of the pending binary operator token.
 420static int GetTokPrecedence() {
 421  if (!isascii(CurTok))
 422    return -1;
 423    
 424  // Make sure it's a declared binop.
 425  int TokPrec = BinopPrecedence[CurTok];
 426  if (TokPrec &lt;= 0) return -1;
 427  return TokPrec;
 428}
 429
 430int main() {
 431  // Install standard binary operators.
 432  // 1 is lowest precedence.
 433  BinopPrecedence['&lt;'] = 10;
 434  BinopPrecedence['+'] = 20;
 435  BinopPrecedence['-'] = 20;
 436  BinopPrecedence['*'] = 40;  // highest.
 437  ...
 438}
 439</pre>
 440</div>
 441
 442<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
 443(this can obviously be extended by you, our brave and intrepid reader).  The
 444<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
 445or -1 if the token is not a binary operator.  Having a map makes it easy to add
 446new operators and makes it clear that the algorithm doesn't depend on the
 447specific operators involved, but it would be easy enough to eliminate the map
 448and do the comparisons in the <tt>GetTokPrecedence</tt> function.  (Or just use
 449a fixed-size array).</p>
 450
 451<p>With the helper above defined, we can now start parsing binary expressions.
 452The basic idea of operator precedence parsing is to break down an expression
 453with potentially ambiguous binary operators into pieces.  Consider ,for example,
 454the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this
 455as a stream of primary expressions separated by binary operators.  As such,
 456it will first parse the leading primary expression "a", then it will see the
 457pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses
 458are primary expressions, the binary expression parser doesn't need to worry
 459about nested subexpressions like (c+d) at all. 
 460</p>
 461
 462<p>
 463To start, an expression is a primary expression potentially followed by a
 464sequence of [binop,primaryexpr] pairs:</p>
 465
 466<div class="doc_code">
 467<pre>
 468/// expression
 469///   ::= primary binoprhs
 470///
 471static ExprAST *ParseExpression() {
 472  ExprAST *LHS = ParsePrimary();
 473  if (!LHS) return 0;
 474  
 475  return ParseBinOpRHS(0, LHS);
 476}
 477</pre>
 478</div>
 479
 480<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
 481us.  It takes a precedence and a pointer to an expression for the part that has been
 482parsed so far.   Note that "x" is a perfectly valid expression: As such, "binoprhs" is
 483allowed to be empty, in which case it returns the expression that is passed into
 484it. In our example above, the code passes the expression for "a" into
 485<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
 486
 487<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
 488minimal operator precedence</em> that the function is allowed to eat.  For
 489example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
 490passed in a precedence of 40, it will not consume any tokens (because the
 491precedence of '+' is only 20).  With this in mind, <tt>ParseBinOpRHS</tt> starts
 492with:</p>
 493
 494<div class="doc_code">
 495<pre>
 496/// binoprhs
 497///   ::= ('+' primary)*
 498static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
 499  // If this is a binop, find its precedence.
 500  while (1) {
 501    int TokPrec = GetTokPrecedence();
 502    
 503    // If this is a binop that binds at least as tightly as the current binop,
 504    // consume it, otherwise we are done.
 505    if (TokPrec &lt; ExprPrec)
 506      return LHS;
 507</pre>
 508</div>
 509
 510<p>This code gets the precedence of the current token and checks to see if if is
 511too low.  Because we defined invalid tokens to have a precedence of -1, this 
 512check implicitly knows that the pair-stream ends when the token stream runs out
 513of binary operators.  If this check succeeds, we know that the token is a binary
 514operator and that it will be included in this expression:</p>
 515
 516<div class="doc_code">
 517<pre>
 518    // Okay, we know this is a binop.
 519    int BinOp = CurTok;
 520    getNextToken();  // eat binop
 521    
 522    // Parse the primary expression after the binary operator.
 523    ExprAST *RHS = ParsePrimary();
 524    if (!RHS) return 0;
 525</pre>
 526</div>
 527
 528<p>As such, this code eats (and remembers) the binary operator and then parses
 529the primary expression that follows.  This builds up the whole pair, the first of
 530which is [+, b] for the running example.</p>
 531
 532<p>Now that we parsed the left-hand side of an expression and one pair of the 
 533RHS sequence, we have to decide which way the expression associates.  In
 534particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)".
 535To determine this, we look ahead at "binop" to determine its precedence and 
 536compare it to BinOp's precedence (which is '+' in this case):</p>
 537
 538<div class="doc_code">
 539<pre>
 540    // If BinOp binds less tightly with RHS than the operator after RHS, let
 541    // the pending operator take RHS as its LHS.
 542    int NextPrec = GetTokPrecedence();
 543    if (TokPrec &lt; NextPrec) {
 544</pre>
 545</div>
 546
 547<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
 548precedence of our current operator, then we know that the parentheses associate
 549as "(a+b) binop ...".  In our example, the current operator is "+" and the next 
 550operator is "+", we know that they have the same precedence.  In this case we'll
 551create the AST node for "a+b", and then continue parsing:</p>
 552
 553<div class="doc_code">
 554<pre>
 555      ... if body omitted ...
 556    }
 557    
 558    // Merge LHS/RHS.
 559    LHS = new BinaryExprAST(BinOp, LHS, RHS);
 560  }  // loop around to the top of the while loop.
 561}
 562</pre>
 563</div>
 564
 565<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
 566iteration of the loop, with "+" as the current token.  The code above will eat, 
 567remember, and parse "(c+d)" as the primary expression, which makes the
 568current pair equal to [+, (c+d)].  It will then evaluate the 'if' conditional above with 
 569"*" as the binop to the right of the primary.  In this case, the precedence of "*" is
 570higher than the precedence of "+" so the if condition will be entered.</p>
 571
 572<p>The critical question left here is "how can the if condition parse the right
 573hand side in full"?  In particular, to build the AST correctly for our example,
 574it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to
 575do this is surprisingly simple (code from the above two blocks duplicated for
 576context):</p>
 577
 578<div class="doc_code">
 579<pre>
 580    // If BinOp binds less tightly with RHS than the operator after RHS, let
 581    // the pending operator take RHS as its LHS.
 582    int NextPrec = GetTokPrecedence();
 583    if (TokPrec &lt; NextPrec) {
 584      <b>RHS = ParseBinOpRHS(TokPrec+1, RHS);
 585      if (RHS == 0) return 0;</b>
 586    }
 587    // Merge LHS/RHS.
 588    LHS = new BinaryExprAST(BinOp, LHS, RHS);
 589  }  // loop around to the top of the while loop.
 590}
 591</pre>
 592</div>
 593
 594<p>At this point, we know that the binary operator to the RHS of our primary
 595has higher precedence than the binop we are currently parsing.  As such, we know
 596that any sequence of pairs whose operators are all higher precedence than "+"
 597should be parsed together and returned as "RHS".  To do this, we recursively
 598invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
 599precedence required for it to continue.  In our example above, this will cause
 600it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
 601of the '+' expression.</p>
 602
 603<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
 604and added to the AST.  With this little bit of code (14 non-trivial lines), we
 605correctly handle fully general binary expression parsing in a very elegant way.
 606This was a whirlwind tour of this code, and it is somewhat subtle.  I recommend
 607running through it with a few tough examples to see how it works.
 608</p>
 609
 610<p>This wraps up handling of expressions.  At this point, we can point the
 611parser at an arbitrary token stream and build an expression from it, stopping
 612at the first token that is not part of the expression.  Next up we need to
 613handle function definitions, etc.</p>
 614
 615</div>
 616
 617<!-- *********************************************************************** -->
 618<h2><a name="parsertop">Parsing the Rest</a></h2>
 619<!-- *********************************************************************** -->
 620
 621<div>
 622
 623<p>
 624The next thing missing is handling of function prototypes.  In Kaleidoscope,
 625these are used both for 'extern' function declarations as well as function body
 626definitions.  The code to do this is straight-forward and not very interesting
 627(once you've survived expressions):
 628</p>
 629
 630<div class="doc_code">
 631<pre>
 632/// prototype
 633///   ::= id '(' id* ')'
 634static PrototypeAST *ParsePrototype() {
 635  if (CurTok != tok_identifier)
 636    return ErrorP("Expected function name in prototype");
 637
 638  std::string FnName = IdentifierStr;
 639  getNextToken();
 640  
 641  if (CurTok != '(')
 642    return ErrorP("Expected '(' in prototype");
 643  
 644  // Read the list of argument names.
 645  std::vector&lt;std::string&gt; ArgNames;
 646  while (getNextToken() == tok_identifier)
 647    ArgNames.push_back(IdentifierStr);
 648  if (CurTok != ')')
 649    return ErrorP("Expected ')' in prototype");
 650  
 651  // success.
 652  getNextToken();  // eat ')'.
 653  
 654  return new PrototypeAST(FnName, ArgNames);
 655}
 656</pre>
 657</div>
 658
 659<p>Given this, a function definition is very simple, just a prototype plus
 660an expression to implement the body:</p>
 661
 662<div class="doc_code">
 663<pre>
 664/// definition ::= 'def' prototype expression
 665static FunctionAST *ParseDefinition() {
 666  getNextToken();  // eat def.
 667  PrototypeAST *Proto = ParsePrototype();
 668  if (Proto == 0) return 0;
 669
 670  if (ExprAST *E = ParseExpression())
 671    return new FunctionAST(Proto, E);
 672  return 0;
 673}
 674</pre>
 675</div>
 676
 677<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
 678well as to support forward declaration of user functions.  These 'extern's are just
 679prototypes with no body:</p>
 680
 681<div class="doc_code">
 682<pre>
 683/// external ::= 'extern' prototype
 684static PrototypeAST *ParseExtern() {
 685  getNextToken();  // eat extern.
 686  return ParsePrototype();
 687}
 688</pre>
 689</div>
 690
 691<p>Finally, we'll also let the user type in arbitrary top-level expressions and
 692evaluate them on the fly.  We will handle this by defining anonymous nullary
 693(zero argument) functions for them:</p>
 694
 695<div class="doc_code">
 696<pre>
 697/// toplevelexpr ::= expression
 698static FunctionAST *ParseTopLevelExpr() {
 699  if (ExprAST *E = ParseExpression()) {
 700    // Make an anonymous proto.
 701    PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
 702    return new FunctionAST(Proto, E);
 703  }
 704  return 0;
 705}
 706</pre>
 707</div>
 708
 709<p>Now that we have all the pieces, let's build a little driver that will let us
 710actually <em>execute</em> this code we've built!</p>
 711
 712</div>
 713
 714<!-- *********************************************************************** -->
 715<h2><a name="driver">The Driver</a></h2>
 716<!-- *********************************************************************** -->
 717
 718<div>
 719
 720<p>The driver for this simply invokes all of the parsing pieces with a top-level
 721dispatch loop.  There isn't much interesting here, so I'll just include the
 722top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level
 723Parsing" section.</p>
 724
 725<div class="doc_code">
 726<pre>
 727/// top ::= definition | external | expression | ';'
 728static void MainLoop() {
 729  while (1) {
 730    fprintf(stderr, "ready&gt; ");
 731    switch (CurTok) {
 732    case tok_eof:    return;
 733    case ';':        getNextToken(); break;  // ignore top-level semicolons.
 734    case tok_def:    HandleDefinition(); break;
 735    case tok_extern: HandleExtern(); break;
 736    default:         HandleTopLevelExpression(); break;
 737    }
 738  }
 739}
 740</pre>
 741</div>
 742
 743<p>The most interesting part of this is that we ignore top-level semicolons.
 744Why is this, you ask?  The basic reason is that if you type "4 + 5" at the
 745command line, the parser doesn't know whether that is the end of what you will type
 746or not.  For example, on the next line you could type "def foo..." in which case
 7474+5 is the end of a top-level expression.  Alternatively you could type "* 6",
 748which would continue the expression.  Having top-level semicolons allows you to
 749type "4+5;", and the parser will know you are done.</p> 
 750
 751</div>
 752
 753<!-- *********************************************************************** -->
 754<h2><a name="conclusions">Conclusions</a></h2>
 755<!-- *********************************************************************** -->
 756
 757<div>
 758
 759<p>With just under 400 lines of commented code (240 lines of non-comment, 
 760non-blank code), we fully defined our minimal language, including a lexer,
 761parser, and AST builder.  With this done, the executable will validate 
 762Kaleidoscope code and tell us if it is grammatically invalid.  For
 763example, here is a sample interaction:</p>
 764
 765<div class="doc_code">
 766<pre>
 767$ <b>./a.out</b>
 768ready&gt; <b>def foo(x y) x+foo(y, 4.0);</b>
 769Parsed a function definition.
 770ready&gt; <b>def foo(x y) x+y y;</b>
 771Parsed a function definition.
 772Parsed a top-level expr
 773ready&gt; <b>def foo(x y) x+y );</b>
 774Parsed a function definition.
 775Error: unknown token when expecting an expression
 776ready&gt; <b>extern sin(a);</b>
 777ready&gt; Parsed an extern
 778ready&gt; <b>^D</b>
 779$ 
 780</pre>
 781</div>
 782
 783<p>There is a lot of room for extension here.  You can define new AST nodes,
 784extend the language in many ways, etc.  In the <a href="LangImpl3.html">next
 785installment</a>, we will describe how to generate LLVM Intermediate
 786Representation (IR) from the AST.</p>
 787
 788</div>
 789
 790<!-- *********************************************************************** -->
 791<h2><a name="code">Full Code Listing</a></h2>
 792<!-- *********************************************************************** -->
 793
 794<div>
 795
 796<p>
 797Here is the complete code listing for this and the previous chapter.  
 798Note that it is fully self-contained: you don't need LLVM or any external
 799libraries at all for this.  (Besides the C and C++ standard libraries, of
 800course.)  To build this, just compile with:</p>
 801
 802<div class="doc_code">
 803<pre>
 804# Compile
 805clang++ -g -O3 toy.cpp
 806# Run
 807./a.out 
 808</pre>
 809</div>
 810
 811<p>Here is the code:</p>
 812
 813<div class="doc_code">
 814<pre>
 815#include &lt;cstdio&gt;
 816#include &lt;cstdlib&gt;
 817#include &lt;string&gt;
 818#include &lt;map&gt;
 819#include &lt;vector&gt;
 820
 821//===----------------------------------------------------------------------===//
 822// Lexer
 823//===----------------------------------------------------------------------===//
 824
 825// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
 826// of these for known things.
 827enum Token {
 828  tok_eof = -1,
 829
 830  // commands
 831  tok_def = -2, tok_extern = -3,
 832
 833  // primary
 834  tok_identifier = -4, tok_number = -5
 835};
 836
 837static std::string IdentifierStr;  // Filled in if tok_identifier
 838static double NumVal;              // Filled in if tok_number
 839
 840/// gettok - Return the next token from standard input.
 841static int gettok() {
 842  static int LastChar = ' ';
 843
 844  // Skip any whitespace.
 845  while (isspace(LastChar))
 846    LastChar = getchar();
 847
 848  if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
 849    IdentifierStr = LastChar;
 850    while (isalnum((LastChar = getchar())))
 851      IdentifierStr += LastChar;
 852
 853    if (IdentifierStr == "def") return tok_def;
 854    if (IdentifierStr == "extern") return tok_extern;
 855    return tok_identifier;
 856  }
 857
 858  if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
 859    std::string NumStr;
 860    do {
 861      NumStr += LastChar;
 862      LastChar = getchar();
 863    } while (isdigit(LastChar) || LastChar == '.');
 864
 865    NumVal = strtod(NumStr.c_str(), 0);
 866    return tok_number;
 867  }
 868
 869  if (LastChar == '#') {
 870    // Comment until end of line.
 871    do LastChar = getchar();
 872    while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp;&amp; LastChar != '\r');
 873    
 874    if (LastChar != EOF)
 875      return gettok();
 876  }
 877  
 878  // Check for end of file.  Don't eat the EOF.
 879  if (LastChar == EOF)
 880    return tok_eof;
 881
 882  // Otherwise, just return the character as its ascii value.
 883  int ThisChar = LastChar;
 884  LastChar = getchar();
 885  return ThisChar;
 886}
 887
 888//===----------------------------------------------------------------------===//
 889// Abstract Syntax Tree (aka Parse Tree)
 890//===----------------------------------------------------------------------===//
 891
 892/// ExprAST - Base class for all expression nodes.
 893class ExprAST {
 894public:
 895  virtual ~ExprAST() {}
 896};
 897
 898/// NumberExprAST - Expression class for numeric literals like "1.0".
 899class NumberExprAST : public ExprAST {
 900  double Val;
 901public:
 902  NumberExprAST(double val) : Val(val) {}
 903};
 904
 905/// VariableExprAST - Expression class for referencing a variable, like "a".
 906class VariableExprAST : public ExprAST {
 907  std::string Name;
 908public:
 909  VariableExprAST(const std::string &amp;name) : Name(name) {}
 910};
 911
 912/// BinaryExprAST - Expression class for a binary operator.
 913class BinaryExprAST : public ExprAST {
 914  char Op;
 915  ExprAST *LHS, *RHS;
 916public:
 917  BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) 
 918    : Op(op), LHS(lhs), RHS(rhs) {}
 919};
 920
 921/// CallExprAST - Expression class for function calls.
 922class CallExprAST : public ExprAST {
 923  std::string Callee;
 924  std::vector&lt;ExprAST*&gt; Args;
 925public:
 926  CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
 927    : Callee(callee), Args(args) {}
 928};
 929
 930/// PrototypeAST - This class represents the "prototype" for a function,
 931/// which captures its name, and its argument names (thus implicitly the number
 932/// of arguments the function takes).
 933class PrototypeAST {
 934  std::string Name;
 935  std::vector&lt;std::string&gt; Args;
 936public:
 937  PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
 938    : Name(name), Args(args) {}
 939  
 940};
 941
 942/// FunctionAST - This class represents a function definition itself.
 943class FunctionAST {
 944  PrototypeAST *Proto;
 945  ExprAST *Body;
 946public:
 947  FunctionAST(PrototypeAST *proto, ExprAST *body)
 948    : Proto(proto), Body(body) {}
 949  
 950};
 951
 952//===----------------------------------------------------------------------===//
 953// Parser
 954//===----------------------------------------------------------------------===//
 955
 956/// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
 957/// token the parser is looking at.  getNextToken reads another token from the
 958/// lexer and updates CurTok with its results.
 959static int CurTok;
 960static int getNextToken() {
 961  return CurTok = gettok();
 962}
 963
 964/// BinopPrecedence - This holds the precedence for each binary operator that is
 965/// defined.
 966static std::map&lt;char, int&gt; BinopPrecedence;
 967
 968/// GetTokPrecedence - Get the precedence of the pending binary operator token.
 969static int GetTokPrecedence() {
 970  if (!isascii(CurTok))
 971    return -1;
 972  
 973  // Make sure it's a declared binop.
 974  int TokPrec = BinopPrecedence[CurTok];
 975  if (TokPrec &lt;= 0) return -1;
 976  return TokPrec;
 977}
 978
 979/// Error* - These are little helper functions for error handling.
 980ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
 981PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
 982FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
 983
 984static ExprAST *ParseExpression();
 985
 986/// identifierexpr
 987///   ::= identifier
 988///   ::= identifier '(' expression* ')'
 989static ExprAST *ParseIdentifierExpr() {
 990  std::string IdName = IdentifierStr;
 991  
 992  getNextToken();  // eat identifier.
 993  
 994  if (CurTok != '(') // Simple variable ref.
 995    return new VariableExprAST(IdName);
 996  
 997  // Call.
 998  getNextToken();  // eat (
 999  std::vector&lt;ExprAST*&gt; Args;
1000  if (CurTok != ')') {
1001    while (1) {
1002      ExprAST *Arg = ParseExpression();
1003      if (!Arg) return 0;
1004      Args.push_back(Arg);
1005
1006      if (CurTok == ')') break;
1007
1008      if (CurTok != ',')
1009        return Error("Expected ')' or ',' in argument list");
1010      getNextToken();
1011    }
1012  }
1013
1014  // Eat the ')'.
1015  getNextToken();
1016  
1017  return new CallExprAST(IdName, Args);
1018}
1019
1020/// numberexpr ::= number
1021static ExprAST *ParseNumberExpr() {
1022  ExprAST *Result = new NumberExprAST(NumVal);
1023  getNextToken(); // consume the number
1024  return Result;
1025}
1026
1027/// parenexpr ::= '(' expression ')'
1028static ExprAST *ParseParenExpr() {
1029  getNextToken();  // eat (.
1030  ExprAST *V = ParseExpression();
1031  if (!V) return 0;
1032  
1033  if (CurTok != ')')
1034    return Error("expected ')'");
1035  getNextToken();  // eat ).
1036  return V;
1037}
1038
1039/// primary
1040///   ::= identifierexpr
1041///   ::= numberexpr
1042///   ::= parenexpr
1043static ExprAST *ParsePrimary() {
1044  switch (CurTok) {
1045  default: return Error("unknown token when expecting an expression");
1046  case tok_identifier: return ParseIdentifierExpr();
1047  case tok_number:     return ParseNumberExpr();
1048  case '(':            return ParseParenExpr();
1049  }
1050}
1051
1052/// binoprhs
1053///   ::= ('+' primary)*
1054static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1055  // If this is a binop, find its precedence.
1056  while (1) {
1057    int TokPrec = GetTokPrecedence();
1058    
1059    // If this is a binop that binds at least as tightly as the current binop,
1060    // consume it, otherwise we are done.
1061    if (TokPrec &lt; ExprPrec)
1062      return LHS;
1063    
1064    // Okay, we know this is a binop.
1065    int BinOp = CurTok;
1066    getNextToken();  // eat binop
1067    
1068    // Parse the primary expression after the binary operator.
1069    ExprAST *RHS = ParsePrimary();
1070    if (!RHS) return 0;
1071    
1072    // If BinOp binds less tightly with RHS than the operator after RHS, let
1073    // the pending operator take RHS as its LHS.
1074    int NextPrec = GetTokPrecedence();
1075    if (TokPrec &lt; NextPrec) {
1076      RHS = ParseBinOpRHS(TokPrec+1, RHS);
1077      if (RHS == 0) return 0;
1078    }
1079    
1080    // Merge LHS/RHS.
1081    LHS = new BinaryExprAST(BinOp, LHS, RHS);
1082  }
1083}
1084
1085/// expression
1086///   ::= primary binoprhs
1087///
1088static ExprAST *ParseExpression() {
1089  ExprAST *LHS = ParsePrimary();
1090  if (!LHS) return 0;
1091  
1092  return ParseBinOpRHS(0, LHS);
1093}
1094
1095/// prototype
1096///   ::= id '(' id* ')'
1097static PrototypeAST *ParsePrototype() {
1098  if (CurTok != tok_identifier)
1099    return ErrorP("Expected function name in prototype");
1100
1101  std::string FnName = IdentifierStr;
1102  getNextToken();
1103  
1104  if (CurTok != '(')
1105    return ErrorP("Expected '(' in prototype");
1106  
1107  std::vector&lt;std::string&gt; ArgNames;
1108  while (getNextToken() == tok_identifier)
1109    ArgNames.push_back(IdentifierStr);
1110  if (CurTok != ')')
1111    return ErrorP("Expected ')' in prototype");
1112  
1113  // success.
1114  getNextToken();  // eat ')'.
1115  
1116  return new PrototypeAST(FnName, ArgNames);
1117}
1118
1119/// definition ::= 'def' prototype expression
1120static FunctionAST *ParseDefinition() {
1121  getNextToken();  // eat def.
1122  PrototypeAST *Proto = ParsePrototype();
1123  if (Proto == 0) return 0;
1124
1125  if (ExprAST *E = ParseExpression())
1126    return new FunctionAST(Proto, E);
1127  return 0;
1128}
1129
1130/// toplevelexpr ::= expression
1131static FunctionAST *ParseTopLevelExpr() {
1132  if (ExprAST *E = ParseExpression()) {
1133    // Make an anonymous proto.
1134    PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
1135    return new FunctionAST(Proto, E);
1136  }
1137  return 0;
1138}
1139
1140/// external ::= 'extern' prototype
1141static PrototypeAST *ParseExtern() {
1142  getNextToken();  // eat extern.
1143  return ParsePrototype();
1144}
1145
1146//===----------------------------------------------------------------------===//
1147// Top-Level parsing
1148//===----------------------------------------------------------------------===//
1149
1150static void HandleDefinition() {
1151  if (ParseDefinition()) {
1152    fprintf(stderr, "Parsed a function definition.\n");
1153  } else {
1154    // Skip token for error recovery.
1155    getNextToken();
1156  }
1157}
1158
1159static void HandleExtern() {
1160  if (ParseExtern()) {
1161    fprintf(stderr, "Parsed an extern\n");
1162  } else {
1163    // Skip token for error recovery.
1164    getNextToken();
1165  }
1166}
1167
1168static void HandleTopLevelExpression() {
1169  // Evaluate a top-level expression into an anonymous function.
1170  if (ParseTopLevelExpr()) {
1171    fprintf(stderr, "Parsed a top-level expr\n");
1172  } else {
1173    // Skip token for error recovery.
1174    getNextToken();
1175  }
1176}
1177
1178/// top ::= definition | external | expression | ';'
1179static void MainLoop() {
1180  while (1) {
1181    fprintf(stderr, "ready&gt; ");
1182    switch (CurTok) {
1183    case tok_eof:    return;
1184    case ';':        getNextToken(); break;  // ignore top-level semicolons.
1185    case tok_def:    HandleDefinition(); break;
1186    case tok_extern: HandleExtern(); break;
1187    default:         HandleTopLevelExpression(); break;
1188    }
1189  }
1190}
1191
1192//===----------------------------------------------------------------------===//
1193// Main driver code.
1194//===----------------------------------------------------------------------===//
1195
1196int main() {
1197  // Install standard binary operators.
1198  // 1 is lowest precedence.
1199  BinopPrecedence['&lt;'] = 10;
1200  BinopPrecedence['+'] = 20;
1201  BinopPrecedence['-'] = 20;
1202  BinopPrecedence['*'] = 40;  // highest.
1203
1204  // Prime the first token.
1205  fprintf(stderr, "ready&gt; ");
1206  getNextToken();
1207
1208  // Run the main "interpreter loop" now.
1209  MainLoop();
1210
1211  return 0;
1212}
1213</pre>
1214</div>
1215<a href="LangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
1216</div>
1217
1218<!-- *********************************************************************** -->
1219<hr>
1220<address>
1221  <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1222  src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1223  <a href="http://validator.w3.org/check/referer"><img
1224  src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
1225
1226  <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1227  <a href="http://llvm.org/">The LLVM Compiler Infrastructure</a><br>
1228  Last modified: $Date: 2012-05-03 06:46:36 +0800 (周四, 03 δΊ”ζœˆ 2012) $
1229</address>
1230</body>
1231</html>