/share/doc/cpp/Tokenization.html
HTML | 255 lines | 196 code | 12 blank | 47 comment | 0 complexity | dc320d64029476ea3f9befff968a05ae MD5 | raw file
- <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
- <html>
- <!-- Copyright (C) 1987-2015 Free Software Foundation, Inc.
- Permission is granted to copy, distribute and/or modify this document
- under the terms of the GNU Free Documentation License, Version 1.3 or
- any later version published by the Free Software Foundation. A copy of
- the license is included in the
- section entitled "GNU Free Documentation License".
- This manual contains no Invariant Sections. The Front-Cover Texts are
- (a) (see below), and the Back-Cover Texts are (b) (see below).
- (a) The FSF's Front-Cover Text is:
- A GNU Manual
- (b) The FSF's Back-Cover Text is:
- You have freedom to copy and modify this GNU Manual, like GNU
- software. Copies published by the Free Software Foundation raise
- funds for GNU development. -->
- <!-- Created by GNU Texinfo 5.2, http://www.gnu.org/software/texinfo/ -->
- <head>
- <title>The C Preprocessor: Tokenization</title>
- <meta name="description" content="The C Preprocessor: Tokenization">
- <meta name="keywords" content="The C Preprocessor: Tokenization">
- <meta name="resource-type" content="document">
- <meta name="distribution" content="global">
- <meta name="Generator" content="makeinfo">
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
- <link href="index.html#Top" rel="start" title="Top">
- <link href="Index-of-Directives.html#Index-of-Directives" rel="index" title="Index of Directives">
- <link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
- <link href="Overview.html#Overview" rel="up" title="Overview">
- <link href="The-preprocessing-language.html#The-preprocessing-language" rel="next" title="The preprocessing language">
- <link href="Initial-processing.html#Initial-processing" rel="prev" title="Initial processing">
- <style type="text/css">
- <!--
- a.summary-letter {text-decoration: none}
- blockquote.smallquotation {font-size: smaller}
- div.display {margin-left: 3.2em}
- div.example {margin-left: 3.2em}
- div.indentedblock {margin-left: 3.2em}
- div.lisp {margin-left: 3.2em}
- div.smalldisplay {margin-left: 3.2em}
- div.smallexample {margin-left: 3.2em}
- div.smallindentedblock {margin-left: 3.2em; font-size: smaller}
- div.smalllisp {margin-left: 3.2em}
- kbd {font-style:oblique}
- pre.display {font-family: inherit}
- pre.format {font-family: inherit}
- pre.menu-comment {font-family: serif}
- pre.menu-preformatted {font-family: serif}
- pre.smalldisplay {font-family: inherit; font-size: smaller}
- pre.smallexample {font-size: smaller}
- pre.smallformat {font-family: inherit; font-size: smaller}
- pre.smalllisp {font-size: smaller}
- span.nocodebreak {white-space:nowrap}
- span.nolinebreak {white-space:nowrap}
- span.roman {font-family:serif; font-weight:normal}
- span.sansserif {font-family:sans-serif; font-weight:normal}
- ul.no-bullet {list-style: none}
- -->
- </style>
- </head>
- <body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
- <a name="Tokenization"></a>
- <div class="header">
- <p>
- Next: <a href="The-preprocessing-language.html#The-preprocessing-language" accesskey="n" rel="next">The preprocessing language</a>, Previous: <a href="Initial-processing.html#Initial-processing" accesskey="p" rel="prev">Initial processing</a>, Up: <a href="Overview.html#Overview" accesskey="u" rel="up">Overview</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index-of-Directives.html#Index-of-Directives" title="Index" rel="index">Index</a>]</p>
- </div>
- <hr>
- <a name="Tokenization-1"></a>
- <h3 class="section">1.3 Tokenization</h3>
- <a name="index-tokens"></a>
- <a name="index-preprocessing-tokens"></a>
- <p>After the textual transformations are finished, the input file is
- converted into a sequence of <em>preprocessing tokens</em>. These mostly
- correspond to the syntactic tokens used by the C compiler, but there are
- a few differences. White space separates tokens; it is not itself a
- token of any kind. Tokens do not have to be separated by white space,
- but it is often necessary to avoid ambiguities.
- </p>
- <p>When faced with a sequence of characters that has more than one possible
- tokenization, the preprocessor is greedy. It always makes each token,
- starting from the left, as big as possible before moving on to the next
- token. For instance, <code>a+++++b</code> is interpreted as
- <code>a ++ ++ + b<!-- /@w --></code>, not as <code>a ++ + ++ b<!-- /@w --></code>, even though the
- latter tokenization could be part of a valid C program and the former
- could not.
- </p>
- <p>Once the input file is broken into tokens, the token boundaries never
- change, except when the ‘<samp>##</samp>’ preprocessing operator is used to paste
- tokens together. See <a href="Concatenation.html#Concatenation">Concatenation</a>. For example,
- </p>
- <div class="smallexample">
- <pre class="smallexample">#define foo() bar
- foo()baz
- → bar baz
- <em>not</em>
- → barbaz
- </pre></div>
- <p>The compiler does not re-tokenize the preprocessor’s output. Each
- preprocessing token becomes one compiler token.
- </p>
- <a name="index-identifiers"></a>
- <p>Preprocessing tokens fall into five broad classes: identifiers,
- preprocessing numbers, string literals, punctuators, and other. An
- <em>identifier</em> is the same as an identifier in C: any sequence of
- letters, digits, or underscores, which begins with a letter or
- underscore. Keywords of C have no significance to the preprocessor;
- they are ordinary identifiers. You can define a macro whose name is a
- keyword, for instance. The only identifier which can be considered a
- preprocessing keyword is <code>defined</code>. See <a href="Defined.html#Defined">Defined</a>.
- </p>
- <p>This is mostly true of other languages which use the C preprocessor.
- However, a few of the keywords of C++ are significant even in the
- preprocessor. See <a href="C_002b_002b-Named-Operators.html#C_002b_002b-Named-Operators">C++ Named Operators</a>.
- </p>
- <p>In the 1999 C standard, identifiers may contain letters which are not
- part of the “basic source character set”, at the implementation’s
- discretion (such as accented Latin letters, Greek letters, or Chinese
- ideograms). This may be done with an extended character set, or the
- ‘<samp>\u</samp>’ and ‘<samp>\U</samp>’ escape sequences. The implementation of this
- feature in GCC is experimental; such characters are only accepted in
- the ‘<samp>\u</samp>’ and ‘<samp>\U</samp>’ forms and only if
- <samp>-fextended-identifiers</samp> is used.
- </p>
- <p>As an extension, GCC treats ‘<samp>$</samp>’ as a letter. This is for
- compatibility with some systems, such as VMS, where ‘<samp>$</samp>’ is commonly
- used in system-defined function and object names. ‘<samp>$</samp>’ is not a
- letter in strictly conforming mode, or if you specify the <samp>-$</samp>
- option. See <a href="Invocation.html#Invocation">Invocation</a>.
- </p>
- <a name="index-numbers"></a>
- <a name="index-preprocessing-numbers"></a>
- <p>A <em>preprocessing number</em> has a rather bizarre definition. The
- category includes all the normal integer and floating point constants
- one expects of C, but also a number of other things one might not
- initially recognize as a number. Formally, preprocessing numbers begin
- with an optional period, a required decimal digit, and then continue
- with any sequence of letters, digits, underscores, periods, and
- exponents. Exponents are the two-character sequences ‘<samp>e+</samp>’,
- ‘<samp>e-</samp>’, ‘<samp>E+</samp>’, ‘<samp>E-</samp>’, ‘<samp>p+</samp>’, ‘<samp>p-</samp>’, ‘<samp>P+</samp>’, and
- ‘<samp>P-</samp>’. (The exponents that begin with ‘<samp>p</samp>’ or ‘<samp>P</samp>’ are new
- to C99. They are used for hexadecimal floating-point constants.)
- </p>
- <p>The purpose of this unusual definition is to isolate the preprocessor
- from the full complexity of numeric constants. It does not have to
- distinguish between lexically valid and invalid floating-point numbers,
- which is complicated. The definition also permits you to split an
- identifier at any position and get exactly two tokens, which can then be
- pasted back together with the ‘<samp>##</samp>’ operator.
- </p>
- <p>It’s possible for preprocessing numbers to cause programs to be
- misinterpreted. For example, <code>0xE+12</code> is a preprocessing number
- which does not translate to any valid numeric constant, therefore a
- syntax error. It does not mean <code>0xE + 12<!-- /@w --></code>, which is what you
- might have intended.
- </p>
- <a name="index-string-literals"></a>
- <a name="index-string-constants"></a>
- <a name="index-character-constants"></a>
- <a name="index-header-file-names"></a>
- <p><em>String literals</em> are string constants, character constants, and
- header file names (the argument of ‘<samp>#include</samp>’).<a name="DOCF2" href="#FOOT2"><sup>2</sup></a> String constants and character
- constants are straightforward: <tt>"…"</tt> or <tt>'…'</tt>. In
- either case embedded quotes should be escaped with a backslash:
- <tt>'\''</tt> is the character constant for ‘<samp>'</samp>’. There is no limit on
- the length of a character constant, but the value of a character
- constant that contains more than one character is
- implementation-defined. See <a href="Implementation-Details.html#Implementation-Details">Implementation Details</a>.
- </p>
- <p>Header file names either look like string constants, <tt>"…"</tt>, or are
- written with angle brackets instead, <tt><…></tt>. In either case,
- backslash is an ordinary character. There is no way to escape the
- closing quote or angle bracket. The preprocessor looks for the header
- file in different places depending on which form you use. See <a href="Include-Operation.html#Include-Operation">Include Operation</a>.
- </p>
- <p>No string literal may extend past the end of a line. Older versions
- of GCC accepted multi-line string constants. You may use continued
- lines instead, or string constant concatenation. See <a href="Differences-from-previous-versions.html#Differences-from-previous-versions">Differences from previous versions</a>.
- </p>
- <a name="index-punctuators"></a>
- <a name="index-digraphs"></a>
- <a name="index-alternative-tokens"></a>
- <p><em>Punctuators</em> are all the usual bits of punctuation which are
- meaningful to C and C++. All but three of the punctuation characters in
- ASCII are C punctuators. The exceptions are ‘<samp>@</samp>’, ‘<samp>$</samp>’, and
- ‘<samp>`</samp>’. In addition, all the two- and three-character operators are
- punctuators. There are also six <em>digraphs</em>, which the C++ standard
- calls <em>alternative tokens</em>, which are merely alternate ways to spell
- other punctuators. This is a second attempt to work around missing
- punctuation in obsolete systems. It has no negative side effects,
- unlike trigraphs, but does not cover as much ground. The digraphs and
- their corresponding normal punctuators are:
- </p>
- <div class="smallexample">
- <pre class="smallexample">Digraph: <% %> <: :> %: %:%:
- Punctuator: { } [ ] # ##
- </pre></div>
- <a name="index-other-tokens"></a>
- <p>Any other single character is considered “other”. It is passed on to
- the preprocessor’s output unmolested. The C compiler will almost
- certainly reject source code containing “other” tokens. In ASCII, the
- only other characters are ‘<samp>@</samp>’, ‘<samp>$</samp>’, ‘<samp>`</samp>’, and control
- characters other than NUL (all bits zero). (Note that ‘<samp>$</samp>’ is
- normally considered a letter.) All characters with the high bit set
- (numeric range 0x7F–0xFF) are also “other” in the present
- implementation. This will change when proper support for international
- character sets is added to GCC.
- </p>
- <p>NUL is a special case because of the high probability that its
- appearance is accidental, and because it may be invisible to the user
- (many terminals do not display NUL at all). Within comments, NULs are
- silently ignored, just as any other character would be. In running
- text, NUL is considered white space. For example, these two directives
- have the same meaning.
- </p>
- <div class="smallexample">
- <pre class="smallexample">#define X^@1
- #define X 1
- </pre></div>
- <p>(where ‘<samp>^@</samp>’ is ASCII NUL). Within string or character constants,
- NULs are preserved. In the latter two cases the preprocessor emits a
- warning message.
- </p>
- <div class="footnote">
- <hr>
- <h4 class="footnotes-heading">Footnotes</h4>
- <h3><a name="FOOT2" href="#DOCF2">(2)</a></h3>
- <p>The C
- standard uses the term <em>string literal</em> to refer only to what we are
- calling <em>string constants</em>.</p>
- </div>
- <hr>
- <div class="header">
- <p>
- Next: <a href="The-preprocessing-language.html#The-preprocessing-language" accesskey="n" rel="next">The preprocessing language</a>, Previous: <a href="Initial-processing.html#Initial-processing" accesskey="p" rel="prev">Initial processing</a>, Up: <a href="Overview.html#Overview" accesskey="u" rel="up">Overview</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index-of-Directives.html#Index-of-Directives" title="Index" rel="index">Index</a>]</p>
- </div>
- </body>
- </html>