190075Sobrien\input texinfo
290075Sobrien@setfilename cppinternals.info
390075Sobrien@settitle The GNU C Preprocessor Internals
490075Sobrien
5169689Skan@include gcc-common.texi
6169689Skan
790075Sobrien@ifinfo
8169689Skan@dircategory Software development
990075Sobrien@direntry
1090075Sobrien* Cpplib: (cppinternals).      Cpplib internals.
1190075Sobrien@end direntry
1290075Sobrien@end ifinfo
1390075Sobrien
1490075Sobrien@c @smallbook
1590075Sobrien@c @cropmarks
1690075Sobrien@c @finalout
1790075Sobrien@setchapternewpage odd
1890075Sobrien@ifinfo
1990075SobrienThis file documents the internals of the GNU C Preprocessor.
2090075Sobrien
21169689SkanCopyright 2000, 2001, 2002, 2004, 2005 Free Software Foundation, Inc.
2290075Sobrien
2390075SobrienPermission is granted to make and distribute verbatim copies of
2490075Sobrienthis manual provided the copyright notice and this permission notice
2590075Sobrienare preserved on all copies.
2690075Sobrien
2790075Sobrien@ignore
2890075SobrienPermission is granted to process this file through Tex and print the
2990075Sobrienresults, provided the printed document carries copying permission
3090075Sobriennotice identical to this one except for the removal of this paragraph
3190075Sobrien(this paragraph not being relevant to the printed manual).
3290075Sobrien
3390075Sobrien@end ignore
3490075SobrienPermission is granted to copy and distribute modified versions of this
3590075Sobrienmanual under the conditions for verbatim copying, provided also that
3690075Sobrienthe entire resulting derived work is distributed under the terms of a
3790075Sobrienpermission notice identical to this one.
3890075Sobrien
3990075SobrienPermission is granted to copy and distribute translations of this manual
4090075Sobrieninto another language, under the above conditions for modified versions.
4190075Sobrien@end ifinfo
4290075Sobrien
4390075Sobrien@titlepage
4490075Sobrien@title Cpplib Internals
45169689Skan@versionsubtitle
4690075Sobrien@author Neil Booth
4790075Sobrien@page
4890075Sobrien@vskip 0pt plus 1filll
4990075Sobrien@c man begin COPYRIGHT
50169689SkanCopyright @copyright{} 2000, 2001, 2002, 2004, 2005
5190075SobrienFree Software Foundation, Inc.
5290075Sobrien
5390075SobrienPermission is granted to make and distribute verbatim copies of
5490075Sobrienthis manual provided the copyright notice and this permission notice
5590075Sobrienare preserved on all copies.
5690075Sobrien
5790075SobrienPermission is granted to copy and distribute modified versions of this
5890075Sobrienmanual under the conditions for verbatim copying, provided also that
5990075Sobrienthe entire resulting derived work is distributed under the terms of a
6090075Sobrienpermission notice identical to this one.
6190075Sobrien
6290075SobrienPermission is granted to copy and distribute translations of this manual
6390075Sobrieninto another language, under the above conditions for modified versions.
6490075Sobrien@c man end
6590075Sobrien@end titlepage
6690075Sobrien@contents
6790075Sobrien@page
6890075Sobrien
6990075Sobrien@node Top
7090075Sobrien@top
7190075Sobrien@chapter Cpplib---the GNU C Preprocessor
7290075Sobrien
73169689SkanThe GNU C preprocessor is
74169689Skanimplemented as a library, @dfn{cpplib}, so it can be easily shared between
75220755Sdima stand-alone preprocessor, and a preprocessor integrated with the C
76220755Sdimand C++ front ends.  It is also available for use by other programs,
77220755Sdimthough this is not recommended as its exposed interface has not yet
78220755Sdimreached a point of reasonable stability.
7990075Sobrien
8090075SobrienThe library has been written to be re-entrant, so that it can be used
8190075Sobriento preprocess many files simultaneously if necessary.  It has also been
8290075Sobrienwritten with the preprocessing token as the fundamental unit; the
8390075Sobrienpreprocessor in previous versions of GCC would operate on text strings
8490075Sobrienas the fundamental unit.
8590075Sobrien
8690075SobrienThis brief manual documents the internals of cpplib, and explains some
8790075Sobrienof the tricky issues.  It is intended that, along with the comments in
8890075Sobrienthe source code, a reasonably competent C programmer should be able to
8990075Sobrienfigure out what the code is doing, and why things have been implemented
9090075Sobrienthe way they have.
9190075Sobrien
9290075Sobrien@menu
9390075Sobrien* Conventions::         Conventions used in the code.
94220755Sdim* Lexer::               The combined C and C++ Lexer.
9590075Sobrien* Hash Nodes::          All identifiers are entered into a hash table.
9690075Sobrien* Macro Expansion::     Macro expansion algorithm.
9790075Sobrien* Token Spacing::       Spacing and paste avoidance issues.
9890075Sobrien* Line Numbering::      Tracking location within files.
9990075Sobrien* Guard Macros::        Optimizing header files with guard macros.
10090075Sobrien* Files::               File handling.
101169689Skan* Concept Index::       Index.
10290075Sobrien@end menu
10390075Sobrien
10490075Sobrien@node Conventions
10590075Sobrien@unnumbered Conventions
10690075Sobrien@cindex interface
10790075Sobrien@cindex header files
10890075Sobrien
10990075Sobriencpplib has two interfaces---one is exposed internally only, and the
11090075Sobrienother is for both internal and external use.
11190075Sobrien
11290075SobrienThe convention is that functions and types that are exposed to multiple
11390075Sobrienfiles internally are prefixed with @samp{_cpp_}, and are to be found in
114169689Skanthe file @file{internal.h}.  Functions and types exposed to external
11590075Sobrienclients are in @file{cpplib.h}, and prefixed with @samp{cpp_}.  For
11690075Sobrienhistorical reasons this is no longer quite true, but we should strive to
11790075Sobrienstick to it.
11890075Sobrien
11990075SobrienWe are striving to reduce the information exposed in @file{cpplib.h} to the
12090075Sobrienbare minimum necessary, and then to keep it there.  This makes clear
12190075Sobrienexactly what external clients are entitled to assume, and allows us to
12290075Sobrienchange internals in the future without worrying whether library clients
12390075Sobrienare perhaps relying on some kind of undocumented implementation-specific
12490075Sobrienbehavior.
12590075Sobrien
12690075Sobrien@node Lexer
12790075Sobrien@unnumbered The Lexer
12890075Sobrien@cindex lexer
12990075Sobrien@cindex newlines
13090075Sobrien@cindex escaped newlines
13190075Sobrien
13290075Sobrien@section Overview
133169689SkanThe lexer is contained in the file @file{lex.c}.  It is a hand-coded
134220755Sdimlexer, and not implemented as a state machine.  It can understand C and
135220755SdimC++ source code, and has been extended to allow reasonably successful
136220755Sdimpreprocessing of assembly language.  The lexer does not make an initial
137220755Sdimpass to strip out trigraphs and escaped newlines, but handles them as
138220755Sdimthey are encountered in a single pass of the input file.  It returns
139220755Sdimpreprocessing tokens individually, not a line at a time.
14090075Sobrien
14190075SobrienIt is mostly transparent to users of the library, since the library's
14290075Sobrieninterface for obtaining the next token, @code{cpp_get_token}, takes care
14390075Sobrienof lexing new tokens, handling directives, and expanding macros as
14490075Sobriennecessary.  However, the lexer does expose some functionality so that
14590075Sobrienclients of the library can easily spell a given token, such as
14690075Sobrien@code{cpp_spell_token} and @code{cpp_token_len}.  These functions are
14790075Sobrienuseful when generating diagnostics, and for emitting the preprocessed
14890075Sobrienoutput.
14990075Sobrien
15090075Sobrien@section Lexing a token
15190075SobrienLexing of an individual token is handled by @code{_cpp_lex_direct} and
15290075Sobrienits subroutines.  In its current form the code is quite complicated,
15390075Sobrienwith read ahead characters and such-like, since it strives to not step
15490075Sobrienback in the character stream in preparation for handling non-ASCII file
15590075Sobrienencodings.  The current plan is to convert any such files to UTF-8
15690075Sobrienbefore processing them.  This complexity is therefore unnecessary and
15790075Sobrienwill be removed, so I'll not discuss it further here.
15890075Sobrien
15990075SobrienThe job of @code{_cpp_lex_direct} is simply to lex a token.  It is not
16090075Sobrienresponsible for issues like directive handling, returning lookahead
16190075Sobrientokens directly, multiple-include optimization, or conditional block
16290075Sobrienskipping.  It necessarily has a minor r@^ole to play in memory
16390075Sobrienmanagement of lexed lines.  I discuss these issues in a separate section
16490075Sobrien(@pxref{Lexing a line}).
16590075Sobrien
16690075SobrienThe lexer places the token it lexes into storage pointed to by the
16790075Sobrienvariable @code{cur_token}, and then increments it.  This variable is
16890075Sobrienimportant for correct diagnostic positioning.  Unless a specific line
16990075Sobrienand column are passed to the diagnostic routines, they will examine the
17090075Sobrien@code{line} and @code{col} values of the token just before the location
17190075Sobrienthat @code{cur_token} points to, and use that location to report the
17290075Sobriendiagnostic.
17390075Sobrien
17490075SobrienThe lexer does not consider whitespace to be a token in its own right.
17590075SobrienIf whitespace (other than a new line) precedes a token, it sets the
17690075Sobrien@code{PREV_WHITE} bit in the token's flags.  Each token has its
17790075Sobrien@code{line} and @code{col} variables set to the line and column of the
17890075Sobrienfirst character of the token.  This line number is the line number in
17990075Sobrienthe translation unit, and can be converted to a source (file, line) pair
18090075Sobrienusing the line map code.
18190075Sobrien
18290075SobrienThe first token on a logical, i.e.@: unescaped, line has the flag
18390075Sobrien@code{BOL} set for beginning-of-line.  This flag is intended for
18490075Sobrieninternal use, both to distinguish a @samp{#} that begins a directive
18590075Sobrienfrom one that doesn't, and to generate a call-back to clients that want
18690075Sobriento be notified about the start of every non-directive line with tokens
18790075Sobrienon it.  Clients cannot reliably determine this for themselves: the first
18890075Sobrientoken might be a macro, and the tokens of a macro expansion do not have
18990075Sobrienthe @code{BOL} flag set.  The macro expansion may even be empty, and the
19090075Sobriennext token on the line certainly won't have the @code{BOL} flag set.
19190075Sobrien
19290075SobrienNew lines are treated specially; exactly how the lexer handles them is
19390075Sobriencontext-dependent.  The C standard mandates that directives are
19490075Sobrienterminated by the first unescaped newline character, even if it appears
19590075Sobrienin the middle of a macro expansion.  Therefore, if the state variable
19690075Sobrien@code{in_directive} is set, the lexer returns a @code{CPP_EOF} token,
19790075Sobrienwhich is normally used to indicate end-of-file, to indicate
19890075Sobrienend-of-directive.  In a directive a @code{CPP_EOF} token never means
19990075Sobrienend-of-file.  Conveniently, if the caller was @code{collect_args}, it
20090075Sobrienalready handles @code{CPP_EOF} as if it were end-of-file, and reports an
20190075Sobrienerror about an unterminated macro argument list.
20290075Sobrien
20390075SobrienThe C standard also specifies that a new line in the middle of the
20490075Sobrienarguments to a macro is treated as whitespace.  This white space is
20590075Sobrienimportant in case the macro argument is stringified.  The state variable
20690075Sobrien@code{parsing_args} is nonzero when the preprocessor is collecting the
20790075Sobrienarguments to a macro call.  It is set to 1 when looking for the opening
20890075Sobrienparenthesis to a function-like macro, and 2 when collecting the actual
20990075Sobrienarguments up to the closing parenthesis, since these two cases need to
21090075Sobrienbe distinguished sometimes.  One such time is here: the lexer sets the
21190075Sobrien@code{PREV_WHITE} flag of a token if it meets a new line when
21290075Sobrien@code{parsing_args} is set to 2.  It doesn't set it if it meets a new
21390075Sobrienline when @code{parsing_args} is 1, since then code like
21490075Sobrien
21590075Sobrien@smallexample
21690075Sobrien#define foo() bar
21790075Sobrienfoo
21890075Sobrienbaz
21990075Sobrien@end smallexample
22090075Sobrien
22190075Sobrien@noindent would be output with an erroneous space before @samp{baz}:
22290075Sobrien
22390075Sobrien@smallexample
22490075Sobrienfoo
22590075Sobrien baz
22690075Sobrien@end smallexample
22790075Sobrien
22890075SobrienThis is a good example of the subtlety of getting token spacing correct
229132718Skanin the preprocessor; there are plenty of tests in the testsuite for
23090075Sobriencorner cases like this.
23190075Sobrien
23290075SobrienThe lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n}
23390075Sobrienand @samp{\n\r} as a single new line indicator.  This allows it to
23490075Sobrientransparently preprocess MS-DOS, Macintosh and Unix files without their
23590075Sobrienneeding to pass through a special filter beforehand.
23690075Sobrien
23790075SobrienWe also decided to treat a backslash, either @samp{\} or the trigraph
23890075Sobrien@samp{??/}, separated from one of the above newline indicators by
23990075Sobriennon-comment whitespace only, as intending to escape the newline.  It
24090075Sobrientends to be a typing mistake, and cannot reasonably be mistaken for
24190075Sobrienanything else in any of the C-family grammars.  Since handling it this
24290075Sobrienway is not strictly conforming to the ISO standard, the library issues a
24390075Sobrienwarning wherever it encounters it.
24490075Sobrien
24590075SobrienHandling newlines like this is made simpler by doing it in one place
24690075Sobrienonly.  The function @code{handle_newline} takes care of all newline
24790075Sobriencharacters, and @code{skip_escaped_newlines} takes care of arbitrarily
24890075Sobrienlong sequences of escaped newlines, deferring to @code{handle_newline}
24990075Sobriento handle the newlines themselves.
25090075Sobrien
25190075SobrienThe most painful aspect of lexing ISO-standard C and C++ is handling
25290075Sobrientrigraphs and backlash-escaped newlines.  Trigraphs are processed before
25390075Sobrienany interpretation of the meaning of a character is made, and unfortunately
25490075Sobrienthere is a trigraph representation for a backslash, so it is possible for
25590075Sobrienthe trigraph @samp{??/} to introduce an escaped newline.
25690075Sobrien
25790075SobrienEscaped newlines are tedious because theoretically they can occur
25890075Sobrienanywhere---between the @samp{+} and @samp{=} of the @samp{+=} token,
25990075Sobrienwithin the characters of an identifier, and even between the @samp{*}
26090075Sobrienand @samp{/} that terminates a comment.  Moreover, you cannot be sure
26190075Sobrienthere is just one---there might be an arbitrarily long sequence of them.
26290075Sobrien
26390075SobrienSo, for example, the routine that lexes a number, @code{parse_number},
26490075Sobriencannot assume that it can scan forwards until the first non-number
26590075Sobriencharacter and be done with it, because this could be the @samp{\}
26690075Sobrienintroducing an escaped newline, or the @samp{?} introducing the trigraph
26790075Sobriensequence that represents the @samp{\} of an escaped newline.  If it
26890075Sobrienencounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}
26990075Sobriento skip over any potential escaped newlines before checking whether the
27090075Sobriennumber has been finished.
27190075Sobrien
27290075SobrienSimilarly code in the main body of @code{_cpp_lex_direct} cannot simply
27390075Sobriencheck for a @samp{=} after a @samp{+} character to determine whether it
27490075Sobrienhas a @samp{+=} token; it needs to be prepared for an escaped newline of
27590075Sobriensome sort.  Such cases use the function @code{get_effective_char}, which
27690075Sobrienreturns the first character after any intervening escaped newlines.
27790075Sobrien
27890075SobrienThe lexer needs to keep track of the correct column position, including
27990075Sobriencounting tabs as specified by the @option{-ftabstop=} option.  This
28090075Sobrienshould be done even within C-style comments; they can appear in the
28190075Sobrienmiddle of a line, and we want to report diagnostics in the correct
28290075Sobrienposition for text appearing after the end of the comment.
28390075Sobrien
28490075Sobrien@anchor{Invalid identifiers}
28590075SobrienSome identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
28690075Sobrienmay be invalid and require a diagnostic.  However, if they appear in a
28790075Sobrienmacro expansion we don't want to complain with each use of the macro.
28890075SobrienIt is therefore best to catch them during the lexing stage, in
28990075Sobrien@code{parse_identifier}.  In both cases, whether a diagnostic is needed
29090075Sobrienor not is dependent upon the lexer's state.  For example, we don't want
29190075Sobriento issue a diagnostic for re-poisoning a poisoned identifier, or for
29290075Sobrienusing @code{__VA_ARGS__} in the expansion of a variable-argument macro.
29390075SobrienTherefore @code{parse_identifier} makes use of state flags to determine
29490075Sobrienwhether a diagnostic is appropriate.  Since we change state on a
29590075Sobrienper-token basis, and don't lex whole lines at a time, this is not a
29690075Sobrienproblem.
29790075Sobrien
29890075SobrienAnother place where state flags are used to change behavior is whilst
29990075Sobrienlexing header names.  Normally, a @samp{<} would be lexed as a single
30090075Sobrientoken.  After a @code{#include} directive, though, it should be lexed as
30190075Sobriena single token as far as the nearest @samp{>} character.  Note that we
30290075Sobriendon't allow the terminators of header names to be escaped; the first
30390075Sobrien@samp{"} or @samp{>} terminates the header name.
30490075Sobrien
30590075SobrienInterpretation of some character sequences depends upon whether we are
306220755Sdimlexing C or C++, and on the revision of the standard in force.  For
307220755Sdimexample, @samp{::} is a single token in C++, but in C it is two
308220755Sdimseparate @samp{:} tokens and almost certainly a syntax error.  Such
30990075Sobriencases are handled by @code{_cpp_lex_direct} based upon command-line
31090075Sobrienflags stored in the @code{cpp_options} structure.
31190075Sobrien
31290075SobrienOnce a token has been lexed, it leads an independent existence.  The
31390075Sobrienspelling of numbers, identifiers and strings is copied to permanent
31490075Sobrienstorage from the original input buffer, so a token remains valid and
31590075Sobriencorrect even if its source buffer is freed with @code{_cpp_pop_buffer}.
31690075SobrienThe storage holding the spellings of such tokens remains until the
31790075Sobrienclient program calls cpp_destroy, probably at the end of the translation
31890075Sobrienunit.
31990075Sobrien
32090075Sobrien@anchor{Lexing a line}
32190075Sobrien@section Lexing a line
32290075Sobrien@cindex token run
32390075Sobrien
32490075SobrienWhen the preprocessor was changed to return pointers to tokens, one
32590075Sobrienfeature I wanted was some sort of guarantee regarding how long a
32690075Sobrienreturned pointer remains valid.  This is important to the stand-alone
32790075Sobrienpreprocessor, the future direction of the C family front ends, and even
32890075Sobriento cpplib itself internally.
32990075Sobrien
33090075SobrienOccasionally the preprocessor wants to be able to peek ahead in the
33190075Sobrientoken stream.  For example, after the name of a function-like macro, it
33290075Sobrienwants to check the next token to see if it is an opening parenthesis.
33390075SobrienAnother example is that, after reading the first few tokens of a
33490075Sobrien@code{#pragma} directive and not recognizing it as a registered pragma,
33590075Sobrienit wants to backtrack and allow the user-defined handler for unknown
33690075Sobrienpragmas to access the full @code{#pragma} token stream.  The stand-alone
33790075Sobrienpreprocessor wants to be able to test the current token with the
33890075Sobrienprevious one to see if a space needs to be inserted to preserve their
33990075Sobrienseparate tokenization upon re-lexing (paste avoidance), so it needs to
34090075Sobrienbe sure the pointer to the previous token is still valid.  The
34190075Sobrienrecursive-descent C++ parser wants to be able to perform tentative
34290075Sobrienparsing arbitrarily far ahead in the token stream, and then to be able
34390075Sobriento jump back to a prior position in that stream if necessary.
34490075Sobrien
34590075SobrienThe rule I chose, which is fairly natural, is to arrange that the
34690075Sobrienpreprocessor lex all tokens on a line consecutively into a token buffer,
34790075Sobrienwhich I call a @dfn{token run}, and when meeting an unescaped new line
34890075Sobrien(newlines within comments do not count either), to start lexing back at
34990075Sobrienthe beginning of the run.  Note that we do @emph{not} lex a line of
35090075Sobrientokens at once; if we did that @code{parse_identifier} would not have
35190075Sobrienstate flags available to warn about invalid identifiers (@pxref{Invalid
35290075Sobrienidentifiers}).
35390075Sobrien
35490075SobrienIn other words, accessing tokens that appeared earlier in the current
35590075Sobrienline is valid, but since each logical line overwrites the tokens of the
35690075Sobrienprevious line, tokens from prior lines are unavailable.  In particular,
35790075Sobriensince a directive only occupies a single logical line, this means that
35890075Sobrienthe directive handlers like the @code{#pragma} handler can jump around
35990075Sobrienin the directive's tokens if necessary.
36090075Sobrien
36190075SobrienTwo issues remain: what about tokens that arise from macro expansions,
36290075Sobrienand what happens when we have a long line that overflows the token run?
36390075Sobrien
36490075SobrienSince we promise clients that we preserve the validity of pointers that
36590075Sobrienwe have already returned for tokens that appeared earlier in the line,
36690075Sobrienwe cannot reallocate the run.  Instead, on overflow it is expanded by
36790075Sobrienchaining a new token run on to the end of the existing one.
36890075Sobrien
36990075SobrienThe tokens forming a macro's replacement list are collected by the
37090075Sobrien@code{#define} handler, and placed in storage that is only freed by
371132718Skan@code{cpp_destroy}.  So if a macro is expanded in the line of tokens,
372132718Skanthe pointers to the tokens of its expansion that are returned will always
37390075Sobrienremain valid.  However, macros are a little trickier than that, since
37490075Sobrienthey give rise to three sources of fresh tokens.  They are the built-in
37590075Sobrienmacros like @code{__LINE__}, and the @samp{#} and @samp{##} operators
37690075Sobrienfor stringification and token pasting.  I handled this by allocating
37790075Sobrienspace for these tokens from the lexer's token run chain.  This means
37890075Sobrienthey automatically receive the same lifetime guarantees as lexed tokens,
37990075Sobrienand we don't need to concern ourselves with freeing them.
38090075Sobrien
38190075SobrienLexing into a line of tokens solves some of the token memory management
38290075Sobrienissues, but not all.  The opening parenthesis after a function-like
38390075Sobrienmacro name might lie on a different line, and the front ends definitely
38490075Sobrienwant the ability to look ahead past the end of the current line.  So
38590075Sobriencpplib only moves back to the start of the token run at the end of a
38690075Sobrienline if the variable @code{keep_tokens} is zero.  Line-buffering is
38790075Sobrienquite natural for the preprocessor, and as a result the only time cpplib
38890075Sobrienneeds to increment this variable is whilst looking for the opening
38990075Sobrienparenthesis to, and reading the arguments of, a function-like macro.  In
39090075Sobrienthe near future cpplib will export an interface to increment and
39190075Sobriendecrement this variable, so that clients can share full control over the
39290075Sobrienlifetime of token pointers too.
39390075Sobrien
39490075SobrienThe routine @code{_cpp_lex_token} handles moving to new token runs,
39590075Sobriencalling @code{_cpp_lex_direct} to lex new tokens, or returning
39690075Sobrienpreviously-lexed tokens if we stepped back in the token stream.  It also
39790075Sobrienchecks each token for the @code{BOL} flag, which might indicate a
39890075Sobriendirective that needs to be handled, or require a start-of-line call-back
39990075Sobriento be made.  @code{_cpp_lex_token} also handles skipping over tokens in
40090075Sobrienfailed conditional blocks, and invalidates the control macro of the
40190075Sobrienmultiple-include optimization if a token was successfully lexed outside
40290075Sobriena directive.  In other words, its callers do not need to concern
40390075Sobrienthemselves with such issues.
40490075Sobrien
40590075Sobrien@node Hash Nodes
40690075Sobrien@unnumbered Hash Nodes
40790075Sobrien@cindex hash table
40890075Sobrien@cindex identifiers
40990075Sobrien@cindex macros
41090075Sobrien@cindex assertions
41190075Sobrien@cindex named operators
41290075Sobrien
41390075SobrienWhen cpplib encounters an ``identifier'', it generates a hash code for
41490075Sobrienit and stores it in the hash table.  By ``identifier'' we mean tokens
41590075Sobrienwith type @code{CPP_NAME}; this includes identifiers in the usual C
41690075Sobriensense, as well as keywords, directive names, macro names and so on.  For
41790075Sobrienexample, all of @code{pragma}, @code{int}, @code{foo} and
41890075Sobrien@code{__GNUC__} are identifiers and hashed when lexed.
41990075Sobrien
42090075SobrienEach node in the hash table contain various information about the
42190075Sobrienidentifier it represents.  For example, its length and type.  At any one
42290075Sobrientime, each identifier falls into exactly one of three categories:
42390075Sobrien
42490075Sobrien@itemize @bullet
42590075Sobrien@item Macros
42690075Sobrien
42790075SobrienThese have been declared to be macros, either on the command line or
42890075Sobrienwith @code{#define}.  A few, such as @code{__TIME__} are built-ins
42990075Sobrienentered in the hash table during initialization.  The hash node for a
43090075Sobriennormal macro points to a structure with more information about the
43190075Sobrienmacro, such as whether it is function-like, how many arguments it takes,
43290075Sobrienand its expansion.  Built-in macros are flagged as special, and instead
43390075Sobriencontain an enum indicating which of the various built-in macros it is.
43490075Sobrien
43590075Sobrien@item Assertions
43690075Sobrien
43790075SobrienAssertions are in a separate namespace to macros.  To enforce this, cpp
43890075Sobrienactually prepends a @code{#} character before hashing and entering it in
43990075Sobrienthe hash table.  An assertion's node points to a chain of answers to
44090075Sobrienthat assertion.
44190075Sobrien
44290075Sobrien@item Void
44390075Sobrien
44490075SobrienEverything else falls into this category---an identifier that is not
44590075Sobriencurrently a macro, or a macro that has since been undefined with
44690075Sobrien@code{#undef}.
44790075Sobrien
44890075SobrienWhen preprocessing C++, this category also includes the named operators,
44990075Sobriensuch as @code{xor}.  In expressions these behave like the operators they
45090075Sobrienrepresent, but in contexts where the spelling of a token matters they
45190075Sobrienare spelt differently.  This spelling distinction is relevant when they
45290075Sobrienare operands of the stringizing and pasting macro operators @code{#} and
45390075Sobrien@code{##}.  Named operator hash nodes are flagged, both to catch the
45490075Sobrienspelling distinction and to prevent them from being defined as macros.
45590075Sobrien@end itemize
45690075Sobrien
45790075SobrienThe same identifiers share the same hash node.  Since each identifier
45890075Sobrientoken, after lexing, contains a pointer to its hash node, this is used
45990075Sobriento provide rapid lookup of various information.  For example, when
46090075Sobrienparsing a @code{#define} statement, CPP flags each argument's identifier
46190075Sobrienhash node with the index of that argument.  This makes duplicated
46290075Sobrienargument checking an O(1) operation for each argument.  Similarly, for
46390075Sobrieneach identifier in the macro's expansion, lookup to see if it is an
46490075Sobrienargument, and which argument it is, is also an O(1) operation.  Further,
46590075Sobrieneach directive name, such as @code{endif}, has an associated directive
46690075Sobrienenum stored in its hash node, so that directive lookup is also O(1).
46790075Sobrien
46890075Sobrien@node Macro Expansion
46990075Sobrien@unnumbered Macro Expansion Algorithm
47090075Sobrien@cindex macro expansion
47190075Sobrien
47290075SobrienMacro expansion is a tricky operation, fraught with nasty corner cases
47390075Sobrienand situations that render what you thought was a nifty way to
47490075Sobrienoptimize the preprocessor's expansion algorithm wrong in quite subtle
47590075Sobrienways.
47690075Sobrien
47790075SobrienI strongly recommend you have a good grasp of how the C and C++
47890075Sobrienstandards require macros to be expanded before diving into this
47990075Sobriensection, let alone the code!.  If you don't have a clear mental
48090075Sobrienpicture of how things like nested macro expansion, stringification and
48190075Sobrientoken pasting are supposed to work, damage to your sanity can quickly
48290075Sobrienresult.
48390075Sobrien
48490075Sobrien@section Internal representation of macros
48590075Sobrien@cindex macro representation (internal)
48690075Sobrien
48790075SobrienThe preprocessor stores macro expansions in tokenized form.  This
48890075Sobriensaves repeated lexing passes during expansion, at the cost of a small
48990075Sobrienincrease in memory consumption on average.  The tokens are stored
49090075Sobriencontiguously in memory, so a pointer to the first one and a token
49190075Sobriencount is all you need to get the replacement list of a macro.
49290075Sobrien
49390075SobrienIf the macro is a function-like macro the preprocessor also stores its
49490075Sobrienparameters, in the form of an ordered list of pointers to the hash
49590075Sobrientable entry of each parameter's identifier.  Further, in the macro's
49690075Sobrienstored expansion each occurrence of a parameter is replaced with a
49790075Sobrienspecial token of type @code{CPP_MACRO_ARG}.  Each such token holds the
49890075Sobrienindex of the parameter it represents in the parameter list, which
49990075Sobrienallows rapid replacement of parameters with their arguments during
50090075Sobrienexpansion.  Despite this optimization it is still necessary to store
50190075Sobrienthe original parameters to the macro, both for dumping with e.g.,
50290075Sobrien@option{-dD}, and to warn about non-trivial macro redefinitions when
50390075Sobrienthe parameter names have changed.
50490075Sobrien
50590075Sobrien@section Macro expansion overview
50690075SobrienThe preprocessor maintains a @dfn{context stack}, implemented as a
50790075Sobrienlinked list of @code{cpp_context} structures, which together represent
50890075Sobrienthe macro expansion state at any one time.  The @code{struct
50990075Sobriencpp_reader} member variable @code{context} points to the current top
51090075Sobrienof this stack.  The top normally holds the unexpanded replacement list
51190075Sobrienof the innermost macro under expansion, except when cpplib is about to
51290075Sobrienpre-expand an argument, in which case it holds that argument's
51390075Sobrienunexpanded tokens.
51490075Sobrien
51590075SobrienWhen there are no macros under expansion, cpplib is in @dfn{base
51690075Sobriencontext}.  All contexts other than the base context contain a
51790075Sobriencontiguous list of tokens delimited by a starting and ending token.
51890075SobrienWhen not in base context, cpplib obtains the next token from the list
51990075Sobrienof the top context.  If there are no tokens left in the list, it pops
52090075Sobrienthat context off the stack, and subsequent ones if necessary, until an
52190075Sobrienunexhausted context is found or it returns to base context.  In base
52290075Sobriencontext, cpplib reads tokens directly from the lexer.
52390075Sobrien
52490075SobrienIf it encounters an identifier that is both a macro and enabled for
52590075Sobrienexpansion, cpplib prepares to push a new context for that macro on the
52690075Sobrienstack by calling the routine @code{enter_macro_context}.  When this
52790075Sobrienroutine returns, the new context will contain the unexpanded tokens of
52890075Sobrienthe replacement list of that macro.  In the case of function-like
52990075Sobrienmacros, @code{enter_macro_context} also replaces any parameters in the
53090075Sobrienreplacement list, stored as @code{CPP_MACRO_ARG} tokens, with the
53190075Sobrienappropriate macro argument.  If the standard requires that the
53290075Sobrienparameter be replaced with its expanded argument, the argument will
53390075Sobrienhave been fully macro expanded first.
53490075Sobrien
53590075Sobrien@code{enter_macro_context} also handles special macros like
53690075Sobrien@code{__LINE__}.  Although these macros expand to a single token which
53790075Sobriencannot contain any further macros, for reasons of token spacing
53890075Sobrien(@pxref{Token Spacing}) and simplicity of implementation, cpplib
53990075Sobrienhandles these special macros by pushing a context containing just that
54090075Sobrienone token.
54190075Sobrien
54290075SobrienThe final thing that @code{enter_macro_context} does before returning
54390075Sobrienis to mark the macro disabled for expansion (except for special macros
54490075Sobrienlike @code{__TIME__}).  The macro is re-enabled when its context is
54590075Sobrienlater popped from the context stack, as described above.  This strict
54690075Sobrienordering ensures that a macro is disabled whilst its expansion is
54790075Sobrienbeing scanned, but that it is @emph{not} disabled whilst any arguments
54890075Sobriento it are being expanded.
54990075Sobrien
55090075Sobrien@section Scanning the replacement list for macros to expand
55190075SobrienThe C standard states that, after any parameters have been replaced
55290075Sobrienwith their possibly-expanded arguments, the replacement list is
55390075Sobrienscanned for nested macros.  Further, any identifiers in the
55490075Sobrienreplacement list that are not expanded during this scan are never
55590075Sobrienagain eligible for expansion in the future, if the reason they were
55690075Sobriennot expanded is that the macro in question was disabled.
55790075Sobrien
55890075SobrienClearly this latter condition can only apply to tokens resulting from
55990075Sobrienargument pre-expansion.  Other tokens never have an opportunity to be
56090075Sobrienre-tested for expansion.  It is possible for identifiers that are
56190075Sobrienfunction-like macros to not expand initially but to expand during a
56290075Sobrienlater scan.  This occurs when the identifier is the last token of an
56390075Sobrienargument (and therefore originally followed by a comma or a closing
56490075Sobrienparenthesis in its macro's argument list), and when it replaces its
56590075Sobrienparameter in the macro's replacement list, the subsequent token
56690075Sobrienhappens to be an opening parenthesis (itself possibly the first token
56790075Sobrienof an argument).
56890075Sobrien
56990075SobrienIt is important to note that when cpplib reads the last token of a
57090075Sobriengiven context, that context still remains on the stack.  Only when
57190075Sobrienlooking for the @emph{next} token do we pop it off the stack and drop
57290075Sobriento a lower context.  This makes backing up by one token easy, but more
57390075Sobrienimportantly ensures that the macro corresponding to the current
57490075Sobriencontext is still disabled when we are considering the last token of
57590075Sobrienits replacement list for expansion (or indeed expanding it).  As an
57690075Sobrienexample, which illustrates many of the points above, consider
57790075Sobrien
57890075Sobrien@smallexample
57990075Sobrien#define foo(x) bar x
58090075Sobrienfoo(foo) (2)
58190075Sobrien@end smallexample
58290075Sobrien
58390075Sobrien@noindent which fully expands to @samp{bar foo (2)}.  During pre-expansion
58490075Sobrienof the argument, @samp{foo} does not expand even though the macro is
58590075Sobrienenabled, since it has no following parenthesis [pre-expansion of an
58690075Sobrienargument only uses tokens from that argument; it cannot take tokens
58790075Sobrienfrom whatever follows the macro invocation].  This still leaves the
58890075Sobrienargument token @samp{foo} eligible for future expansion.  Then, when
58990075Sobrienre-scanning after argument replacement, the token @samp{foo} is
59090075Sobrienrejected for expansion, and marked ineligible for future expansion,
59190075Sobriensince the macro is now disabled.  It is disabled because the
59290075Sobrienreplacement list @samp{bar foo} of the macro is still on the context
59390075Sobrienstack.
59490075Sobrien
59590075SobrienIf instead the algorithm looked for an opening parenthesis first and
59690075Sobrienthen tested whether the macro were disabled it would be subtly wrong.
59790075SobrienIn the example above, the replacement list of @samp{foo} would be
59890075Sobrienpopped in the process of finding the parenthesis, re-enabling
59990075Sobrien@samp{foo} and expanding it a second time.
60090075Sobrien
60190075Sobrien@section Looking for a function-like macro's opening parenthesis
60290075SobrienFunction-like macros only expand when immediately followed by a
60390075Sobrienparenthesis.  To do this cpplib needs to temporarily disable macros
60490075Sobrienand read the next token.  Unfortunately, because of spacing issues
60590075Sobrien(@pxref{Token Spacing}), there can be fake padding tokens in-between,
60690075Sobrienand if the next real token is not a parenthesis cpplib needs to be
60790075Sobrienable to back up that one token as well as retain the information in
60890075Sobrienany intervening padding tokens.
60990075Sobrien
61090075SobrienBacking up more than one token when macros are involved is not
61190075Sobrienpermitted by cpplib, because in general it might involve issues like
61290075Sobrienrestoring popped contexts onto the context stack, which are too hard.
61390075SobrienInstead, searching for the parenthesis is handled by a special
61490075Sobrienfunction, @code{funlike_invocation_p}, which remembers padding
61590075Sobrieninformation as it reads tokens.  If the next real token is not an
61690075Sobrienopening parenthesis, it backs up that one token, and then pushes an
61790075Sobrienextra context just containing the padding information if necessary.
61890075Sobrien
61990075Sobrien@section Marking tokens ineligible for future expansion
62090075SobrienAs discussed above, cpplib needs a way of marking tokens as
62190075Sobrienunexpandable.  Since the tokens cpplib handles are read-only once they
62290075Sobrienhave been lexed, it instead makes a copy of the token and adds the
62390075Sobrienflag @code{NO_EXPAND} to the copy.
62490075Sobrien
62590075SobrienFor efficiency and to simplify memory management by avoiding having to
62690075Sobrienremember to free these tokens, they are allocated as temporary tokens
62790075Sobrienfrom the lexer's current token run (@pxref{Lexing a line}) using the
62890075Sobrienfunction @code{_cpp_temp_token}.  The tokens are then re-used once the
62990075Sobriencurrent line of tokens has been read in.
63090075Sobrien
63190075SobrienThis might sound unsafe.  However, tokens runs are not re-used at the
63290075Sobrienend of a line if it happens to be in the middle of a macro argument
63390075Sobrienlist, and cpplib only wants to back-up more than one lexer token in
63490075Sobriensituations where no macro expansion is involved, so the optimization
63590075Sobrienis safe.
63690075Sobrien
63790075Sobrien@node Token Spacing
63890075Sobrien@unnumbered Token Spacing
63990075Sobrien@cindex paste avoidance
64090075Sobrien@cindex spacing
64190075Sobrien@cindex token spacing
64290075Sobrien
643132718SkanFirst, consider an issue that only concerns the stand-alone
644132718Skanpreprocessor: there needs to be a guarantee that re-reading its preprocessed
64590075Sobrienoutput results in an identical token stream.  Without taking special
64690075Sobrienmeasures, this might not be the case because of macro substitution.
64790075SobrienFor example:
64890075Sobrien
64990075Sobrien@smallexample
65090075Sobrien#define PLUS +
65190075Sobrien#define EMPTY
65290075Sobrien#define f(x) =x=
65390075Sobrien+PLUS -EMPTY- PLUS+ f(=)
65490075Sobrien        @expansion{} + + - - + + = = =
65590075Sobrien@emph{not}
65690075Sobrien        @expansion{} ++ -- ++ ===
65790075Sobrien@end smallexample
65890075Sobrien
65990075SobrienOne solution would be to simply insert a space between all adjacent
66090075Sobrientokens.  However, we would like to keep space insertion to a minimum,
66190075Sobrienboth for aesthetic reasons and because it causes problems for people who
66290075Sobrienstill try to abuse the preprocessor for things like Fortran source and
66390075SobrienMakefiles.
66490075Sobrien
66590075SobrienFor now, just notice that when tokens are added (or removed, as shown by
66690075Sobrienthe @code{EMPTY} example) from the original lexed token stream, we need
66790075Sobriento check for accidental token pasting.  We call this @dfn{paste
66890075Sobrienavoidance}.  Token addition and removal can only occur because of macro
66990075Sobrienexpansion, but accidental pasting can occur in many places: both before
67090075Sobrienand after each macro replacement, each argument replacement, and
67190075Sobrienadditionally each token created by the @samp{#} and @samp{##} operators.
67290075Sobrien
673132718SkanLook at how the preprocessor gets whitespace output correct
67490075Sobriennormally.  The @code{cpp_token} structure contains a flags byte, and one
67590075Sobrienof those flags is @code{PREV_WHITE}.  This is flagged by the lexer, and
67690075Sobrienindicates that the token was preceded by whitespace of some form other
67790075Sobrienthan a new line.  The stand-alone preprocessor can use this flag to
67890075Sobriendecide whether to insert a space between tokens in the output.
67990075Sobrien
68090075SobrienNow consider the result of the following macro expansion:
68190075Sobrien
68290075Sobrien@smallexample
68390075Sobrien#define add(x, y, z) x + y +z;
68490075Sobriensum = add (1,2, 3);
68590075Sobrien        @expansion{} sum = 1 + 2 +3;
68690075Sobrien@end smallexample
68790075Sobrien
68890075SobrienThe interesting thing here is that the tokens @samp{1} and @samp{2} are
68990075Sobrienoutput with a preceding space, and @samp{3} is output without a
69090075Sobrienpreceding space, but when lexed none of these tokens had that property.
69190075SobrienCareful consideration reveals that @samp{1} gets its preceding
69290075Sobrienwhitespace from the space preceding @samp{add} in the macro invocation,
69390075Sobrien@emph{not} replacement list.  @samp{2} gets its whitespace from the
69490075Sobrienspace preceding the parameter @samp{y} in the macro replacement list,
69590075Sobrienand @samp{3} has no preceding space because parameter @samp{z} has none
69690075Sobrienin the replacement list.
69790075Sobrien
69890075SobrienOnce lexed, tokens are effectively fixed and cannot be altered, since
69990075Sobrienpointers to them might be held in many places, in particular by
70090075Sobrienin-progress macro expansions.  So instead of modifying the two tokens
70190075Sobrienabove, the preprocessor inserts a special token, which I call a
70290075Sobrien@dfn{padding token}, into the token stream to indicate that spacing of
70390075Sobrienthe subsequent token is special.  The preprocessor inserts padding
70490075Sobrientokens in front of every macro expansion and expanded macro argument.
70590075SobrienThese point to a @dfn{source token} from which the subsequent real token
70690075Sobrienshould inherit its spacing.  In the above example, the source tokens are
70790075Sobrien@samp{add} in the macro invocation, and @samp{y} and @samp{z} in the
70890075Sobrienmacro replacement list, respectively.
70990075Sobrien
71090075SobrienIt is quite easy to get multiple padding tokens in a row, for example if
71190075Sobriena macro's first replacement token expands straight into another macro.
71290075Sobrien
71390075Sobrien@smallexample
71490075Sobrien#define foo bar
71590075Sobrien#define bar baz
71690075Sobrien[foo]
71790075Sobrien        @expansion{} [baz]
71890075Sobrien@end smallexample
71990075Sobrien
72090075SobrienHere, two padding tokens are generated with sources the @samp{foo} token
72190075Sobrienbetween the brackets, and the @samp{bar} token from foo's replacement
722132718Skanlist, respectively.  Clearly the first padding token is the one to
723132718Skanuse, so the output code should contain a rule that the first
72490075Sobrienpadding token in a sequence is the one that matters.
72590075Sobrien
726132718SkanBut what if a macro expansion is left?  Adjusting the above
72790075Sobrienexample slightly:
72890075Sobrien
72990075Sobrien@smallexample
73090075Sobrien#define foo bar
73190075Sobrien#define bar EMPTY baz
73290075Sobrien#define EMPTY
73390075Sobrien[foo] EMPTY;
73490075Sobrien        @expansion{} [ baz] ;
73590075Sobrien@end smallexample
73690075Sobrien
73790075SobrienAs shown, now there should be a space before @samp{baz} and the
73890075Sobriensemicolon in the output.
73990075Sobrien
74090075SobrienThe rules we decided above fail for @samp{baz}: we generate three
74190075Sobrienpadding tokens, one per macro invocation, before the token @samp{baz}.
74290075SobrienWe would then have it take its spacing from the first of these, which
74390075Sobriencarries source token @samp{foo} with no leading space.
74490075Sobrien
74590075SobrienIt is vital that cpplib get spacing correct in these examples since any
74690075Sobrienof these macro expansions could be stringified, where spacing matters.
74790075Sobrien
74890075SobrienSo, this demonstrates that not just entering macro and argument
74990075Sobrienexpansions, but leaving them requires special handling too.  I made
75090075Sobriencpplib insert a padding token with a @code{NULL} source token when
75190075Sobrienleaving macro expansions, as well as after each replaced argument in a
75290075Sobrienmacro's replacement list.  It also inserts appropriate padding tokens on
75390075Sobrieneither side of tokens created by the @samp{#} and @samp{##} operators.
75490075SobrienI expanded the rule so that, if we see a padding token with a
75590075Sobrien@code{NULL} source token, @emph{and} that source token has no leading
75690075Sobrienspace, then we behave as if we have seen no padding tokens at all.  A
75790075Sobrienquick check shows this rule will then get the above example correct as
75890075Sobrienwell.
75990075Sobrien
76090075SobrienNow a relationship with paste avoidance is apparent: we have to be
76190075Sobriencareful about paste avoidance in exactly the same locations we have
76290075Sobrienpadding tokens in order to get white space correct.  This makes
76390075Sobrienimplementation of paste avoidance easy: wherever the stand-alone
76490075Sobrienpreprocessor is fixing up spacing because of padding tokens, and it
76590075Sobrienturns out that no space is needed, it has to take the extra step to
76690075Sobriencheck that a space is not needed after all to avoid an accidental paste.
76790075SobrienThe function @code{cpp_avoid_paste} advises whether a space is required
76890075Sobrienbetween two consecutive tokens.  To avoid excessive spacing, it tries
76990075Sobrienhard to only require a space if one is likely to be necessary, but for
77090075Sobrienreasons of efficiency it is slightly conservative and might recommend a
77190075Sobrienspace where one is not strictly needed.
77290075Sobrien
77390075Sobrien@node Line Numbering
77490075Sobrien@unnumbered Line numbering
77590075Sobrien@cindex line numbers
77690075Sobrien
77790075Sobrien@section Just which line number anyway?
77890075Sobrien
77990075SobrienThere are three reasonable requirements a cpplib client might have for
78090075Sobrienthe line number of a token passed to it:
78190075Sobrien
78290075Sobrien@itemize @bullet
78390075Sobrien@item
78490075SobrienThe source line it was lexed on.
78590075Sobrien@item
78690075SobrienThe line it is output on.  This can be different to the line it was
78790075Sobrienlexed on if, for example, there are intervening escaped newlines or
78890075SobrienC-style comments.  For example:
78990075Sobrien
79090075Sobrien@smallexample
791169689Skanfoo /* @r{A long
792169689Skancomment} */ bar \
79390075Sobrienbaz
79490075Sobrien@result{}
79590075Sobrienfoo bar baz
79690075Sobrien@end smallexample
79790075Sobrien
79890075Sobrien@item
79990075SobrienIf the token results from a macro expansion, the line of the macro name,
80090075Sobrienor possibly the line of the closing parenthesis in the case of
80190075Sobrienfunction-like macro expansion.
80290075Sobrien@end itemize
80390075Sobrien
80490075SobrienThe @code{cpp_token} structure contains @code{line} and @code{col}
80590075Sobrienmembers.  The lexer fills these in with the line and column of the first
80690075Sobriencharacter of the token.  Consequently, but maybe unexpectedly, a token
80790075Sobrienfrom the replacement list of a macro expansion carries the location of
80890075Sobrienthe token within the @code{#define} directive, because cpplib expands a
80990075Sobrienmacro by returning pointers to the tokens in its replacement list.  The
81090075Sobriencurrent implementation of cpplib assigns tokens created from built-in
81190075Sobrienmacros and the @samp{#} and @samp{##} operators the location of the most
81290075Sobrienrecently lexed token.  This is a because they are allocated from the
81390075Sobrienlexer's token runs, and because of the way the diagnostic routines infer
81490075Sobrienthe appropriate location to report.
81590075Sobrien
81690075SobrienThe diagnostic routines in cpplib display the location of the most
81790075Sobrienrecently @emph{lexed} token, unless they are passed a specific line and
81890075Sobriencolumn to report.  For diagnostics regarding tokens that arise from
81990075Sobrienmacro expansions, it might also be helpful for the user to see the
82090075Sobrienoriginal location in the macro definition that the token came from.
82190075SobrienSince that is exactly the information each token carries, such an
82290075Sobrienenhancement could be made relatively easily in future.
82390075Sobrien
82490075SobrienThe stand-alone preprocessor faces a similar problem when determining
82590075Sobrienthe correct line to output the token on: the position attached to a
82690075Sobrientoken is fairly useless if the token came from a macro expansion.  All
82790075Sobrientokens on a logical line should be output on its first physical line, so
82890075Sobrienthe token's reported location is also wrong if it is part of a physical
82990075Sobrienline other than the first.
83090075Sobrien
83190075SobrienTo solve these issues, cpplib provides a callback that is generated
83290075Sobrienwhenever it lexes a preprocessing token that starts a new logical line
83390075Sobrienother than a directive.  It passes this token (which may be a
83490075Sobrien@code{CPP_EOF} token indicating the end of the translation unit) to the
83590075Sobriencallback routine, which can then use the line and column of this token
83690075Sobriento produce correct output.
83790075Sobrien
83890075Sobrien@section Representation of line numbers
83990075Sobrien
84090075SobrienAs mentioned above, cpplib stores with each token the line number that
84190075Sobrienit was lexed on.  In fact, this number is not the number of the line in
84290075Sobrienthe source file, but instead bears more resemblance to the number of the
84390075Sobrienline in the translation unit.
84490075Sobrien
84590075SobrienThe preprocessor maintains a monotonic increasing line count, which is
84690075Sobrienincremented at every new line character (and also at the end of any
84790075Sobrienbuffer that does not end in a new line).  Since a line number of zero is
84890075Sobrienuseful to indicate certain special states and conditions, this variable
84990075Sobrienstarts counting from one.
85090075Sobrien
85190075SobrienThis variable therefore uniquely enumerates each line in the translation
85290075Sobrienunit.  With some simple infrastructure, it is straight forward to map
85390075Sobrienfrom this to the original source file and line number pair, saving space
85490075Sobrienwhenever line number information needs to be saved.  The code the
85590075Sobrienimplements this mapping lies in the files @file{line-map.c} and
85690075Sobrien@file{line-map.h}.
85790075Sobrien
85890075SobrienCommand-line macros and assertions are implemented by pushing a buffer
85990075Sobriencontaining the right hand side of an equivalent @code{#define} or
86090075Sobrien@code{#assert} directive.  Some built-in macros are handled similarly.
86190075SobrienSince these are all processed before the first line of the main input
86290075Sobrienfile, it will typically have an assigned line closer to twenty than to
86390075Sobrienone.
86490075Sobrien
86590075Sobrien@node Guard Macros
86690075Sobrien@unnumbered The Multiple-Include Optimization
86790075Sobrien@cindex guard macros
86890075Sobrien@cindex controlling macros
86990075Sobrien@cindex multiple-include optimization
87090075Sobrien
87190075SobrienHeader files are often of the form
87290075Sobrien
87390075Sobrien@smallexample
87490075Sobrien#ifndef FOO
87590075Sobrien#define FOO
87690075Sobrien@dots{}
87790075Sobrien#endif
87890075Sobrien@end smallexample
87990075Sobrien
88090075Sobrien@noindent
88190075Sobriento prevent the compiler from processing them more than once.  The
88290075Sobrienpreprocessor notices such header files, so that if the header file
88390075Sobrienappears in a subsequent @code{#include} directive and @code{FOO} is
88490075Sobriendefined, then it is ignored and it doesn't preprocess or even re-open
88590075Sobrienthe file a second time.  This is referred to as the @dfn{multiple
88690075Sobrieninclude optimization}.
88790075Sobrien
88890075SobrienUnder what circumstances is such an optimization valid?  If the file
88990075Sobrienwere included a second time, it can only be optimized away if that
89090075Sobrieninclusion would result in no tokens to return, and no relevant
89190075Sobriendirectives to process.  Therefore the current implementation imposes
89290075Sobrienrequirements and makes some allowances as follows:
89390075Sobrien
89490075Sobrien@enumerate
89590075Sobrien@item
89690075SobrienThere must be no tokens outside the controlling @code{#if}-@code{#endif}
89790075Sobrienpair, but whitespace and comments are permitted.
89890075Sobrien
89990075Sobrien@item
90090075SobrienThere must be no directives outside the controlling directive pair, but
90190075Sobrienthe @dfn{null directive} (a line containing nothing other than a single
90290075Sobrien@samp{#} and possibly whitespace) is permitted.
90390075Sobrien
90490075Sobrien@item
90590075SobrienThe opening directive must be of the form
90690075Sobrien
90790075Sobrien@smallexample
90890075Sobrien#ifndef FOO
90990075Sobrien@end smallexample
91090075Sobrien
91190075Sobrienor
91290075Sobrien
91390075Sobrien@smallexample
91490075Sobrien#if !defined FOO     [equivalently, #if !defined(FOO)]
91590075Sobrien@end smallexample
91690075Sobrien
91790075Sobrien@item
91890075SobrienIn the second form above, the tokens forming the @code{#if} expression
91990075Sobrienmust have come directly from the source file---no macro expansion must
92090075Sobrienhave been involved.  This is because macro definitions can change, and
92190075Sobrientracking whether or not a relevant change has been made is not worth the
92290075Sobrienimplementation cost.
92390075Sobrien
92490075Sobrien@item
92590075SobrienThere can be no @code{#else} or @code{#elif} directives at the outer
92690075Sobrienconditional block level, because they would probably contain something
92790075Sobrienof interest to a subsequent pass.
92890075Sobrien@end enumerate
92990075Sobrien
93090075SobrienFirst, when pushing a new file on the buffer stack,
93190075Sobrien@code{_stack_include_file} sets the controlling macro @code{mi_cmacro} to
93290075Sobrien@code{NULL}, and sets @code{mi_valid} to @code{true}.  This indicates
93390075Sobrienthat the preprocessor has not yet encountered anything that would
93490075Sobrieninvalidate the multiple-include optimization.  As described in the next
93590075Sobrienfew paragraphs, these two variables having these values effectively
93690075Sobrienindicates top-of-file.
93790075Sobrien
93890075SobrienWhen about to return a token that is not part of a directive,
93990075Sobrien@code{_cpp_lex_token} sets @code{mi_valid} to @code{false}.  This
94090075Sobrienenforces the constraint that tokens outside the controlling conditional
94190075Sobrienblock invalidate the optimization.
94290075Sobrien
94390075SobrienThe @code{do_if}, when appropriate, and @code{do_ifndef} directive
94490075Sobrienhandlers pass the controlling macro to the function
94590075Sobrien@code{push_conditional}.  cpplib maintains a stack of nested conditional
94690075Sobrienblocks, and after processing every opening conditional this function
94790075Sobrienpushes an @code{if_stack} structure onto the stack.  In this structure
94890075Sobrienit records the controlling macro for the block, provided there is one
94990075Sobrienand we're at top-of-file (as described above).  If an @code{#elif} or
95090075Sobrien@code{#else} directive is encountered, the controlling macro for that
95190075Sobrienblock is cleared to @code{NULL}.  Otherwise, it survives until the
95290075Sobrien@code{#endif} closing the block, upon which @code{do_endif} sets
95390075Sobrien@code{mi_valid} to true and stores the controlling macro in
95490075Sobrien@code{mi_cmacro}.
95590075Sobrien
95690075Sobrien@code{_cpp_handle_directive} clears @code{mi_valid} when processing any
95790075Sobriendirective other than an opening conditional and the null directive.
95890075SobrienWith this, and requiring top-of-file to record a controlling macro, and
95990075Sobrienno @code{#else} or @code{#elif} for it to survive and be copied to
96090075Sobrien@code{mi_cmacro} by @code{do_endif}, we have enforced the absence of
96190075Sobriendirectives outside the main conditional block for the optimization to be
96290075Sobrienon.
96390075Sobrien
96490075SobrienNote that whilst we are inside the conditional block, @code{mi_valid} is
965169689Skanlikely to be reset to @code{false}, but this does not matter since
96690075Sobrienthe closing @code{#endif} restores it to @code{true} if appropriate.
96790075Sobrien
96890075SobrienFinally, since @code{_cpp_lex_direct} pops the file off the buffer stack
96990075Sobrienat @code{EOF} without returning a token, if the @code{#endif} directive
97090075Sobrienwas not followed by any tokens, @code{mi_valid} is @code{true} and
97190075Sobrien@code{_cpp_pop_file_buffer} remembers the controlling macro associated
97290075Sobrienwith the file.  Subsequent calls to @code{stack_include_file} result in
97390075Sobrienno buffer being pushed if the controlling macro is defined, effecting
97490075Sobrienthe optimization.
97590075Sobrien
97690075SobrienA quick word on how we handle the
97790075Sobrien
97890075Sobrien@smallexample
97990075Sobrien#if !defined FOO
98090075Sobrien@end smallexample
98190075Sobrien
98290075Sobrien@noindent
98390075Sobriencase.  @code{_cpp_parse_expr} and @code{parse_defined} take steps to see
98490075Sobrienwhether the three stages @samp{!}, @samp{defined-expression} and
98590075Sobrien@samp{end-of-directive} occur in order in a @code{#if} expression.  If
98690075Sobrienso, they return the guard macro to @code{do_if} in the variable
98790075Sobrien@code{mi_ind_cmacro}, and otherwise set it to @code{NULL}.
98890075Sobrien@code{enter_macro_context} sets @code{mi_valid} to false, so if a macro
98990075Sobrienwas expanded whilst parsing any part of the expression, then the
99090075Sobrientop-of-file test in @code{push_conditional} fails and the optimization
99190075Sobrienis turned off.
99290075Sobrien
99390075Sobrien@node Files
99490075Sobrien@unnumbered File Handling
99590075Sobrien@cindex files
99690075Sobrien
99790075SobrienFairly obviously, the file handling code of cpplib resides in the file
998169689Skan@file{files.c}.  It takes care of the details of file searching,
99990075Sobrienopening, reading and caching, for both the main source file and all the
100090075Sobrienheaders it recursively includes.
100190075Sobrien
100290075SobrienThe basic strategy is to minimize the number of system calls.  On many
100390075Sobriensystems, the basic @code{open ()} and @code{fstat ()} system calls can
100490075Sobrienbe quite expensive.  For every @code{#include}-d file, we need to try
100590075Sobrienall the directories in the search path until we find a match.  Some
100690075Sobrienprojects, such as glibc, pass twenty or thirty include paths on the
100790075Sobriencommand line, so this can rapidly become time consuming.
100890075Sobrien
100990075SobrienFor a header file we have not encountered before we have little choice
101090075Sobrienbut to do this.  However, it is often the case that the same headers are
101190075Sobrienrepeatedly included, and in these cases we try to avoid repeating the
101290075Sobrienfilesystem queries whilst searching for the correct file.
101390075Sobrien
101490075SobrienFor each file we try to open, we store the constructed path in a splay
101590075Sobrientree.  This path first undergoes simplification by the function
101690075Sobrien@code{_cpp_simplify_pathname}.  For example,
101790075Sobrien@file{/usr/include/bits/../foo.h} is simplified to
101890075Sobrien@file{/usr/include/foo.h} before we enter it in the splay tree and try
101990075Sobriento @code{open ()} the file.  CPP will then find subsequent uses of
102090075Sobrien@file{foo.h}, even as @file{/usr/include/foo.h}, in the splay tree and
102190075Sobriensave system calls.
102290075Sobrien
102390075SobrienFurther, it is likely the file contents have also been cached, saving a
102490075Sobrien@code{read ()} system call.  We don't bother caching the contents of
102590075Sobrienheader files that are re-inclusion protected, and whose re-inclusion
102690075Sobrienmacro is defined when we leave the header file for the first time.  If
102790075Sobrienthe host supports it, we try to map suitably large files into memory,
102890075Sobrienrather than reading them in directly.
102990075Sobrien
103090075SobrienThe include paths are internally stored on a null-terminated
103190075Sobriensingly-linked list, starting with the @code{"header.h"} directory search
103290075Sobrienchain, which then links into the @code{<header.h>} directory chain.
103390075Sobrien
103490075SobrienFiles included with the @code{<foo.h>} syntax start the lookup directly
103590075Sobrienin the second half of this chain.  However, files included with the
103690075Sobrien@code{"foo.h"} syntax start at the beginning of the chain, but with one
103790075Sobrienextra directory prepended.  This is the directory of the current file;
103890075Sobrienthe one containing the @code{#include} directive.  Prepending this
103990075Sobriendirectory on a per-file basis is handled by the function
104090075Sobrien@code{search_from}.
104190075Sobrien
104290075SobrienNote that a header included with a directory component, such as
104390075Sobrien@code{#include "mydir/foo.h"} and opened as
104490075Sobrien@file{/usr/local/include/mydir/foo.h}, will have the complete path minus
104590075Sobrienthe basename @samp{foo.h} as the current directory.
104690075Sobrien
104790075SobrienEnough information is stored in the splay tree that CPP can immediately
104890075Sobrientell whether it can skip the header file because of the multiple include
104990075Sobrienoptimization, whether the file didn't exist or couldn't be opened for
105090075Sobriensome reason, or whether the header was flagged not to be re-used, as it
105190075Sobrienis with the obsolete @code{#import} directive.
105290075Sobrien
105390075SobrienFor the benefit of MS-DOS filesystems with an 8.3 filename limitation,
105490075SobrienCPP offers the ability to treat various include file names as aliases
105590075Sobrienfor the real header files with shorter names.  The map from one to the
105690075Sobrienother is found in a special file called @samp{header.gcc}, stored in the
105790075Sobriencommand line (or system) include directories to which the mapping
105890075Sobrienapplies.  This may be higher up the directory tree than the full path to
105990075Sobrienthe file minus the base name.
106090075Sobrien
1061169689Skan@node Concept Index
1062169689Skan@unnumbered Concept Index
106390075Sobrien@printindex cp
106490075Sobrien
106590075Sobrien@bye
1066