C++ – What are the stages of compilation of a C++ program

cc++-faqcompilationcompiler-construction

Are the stages of compilation of a C++ program specified by the standard?

If so, what are they?

If not, an answer for a widely-used compiler (I'd prefer MSVS) would be great.

I'm talking about preprocessing, tokenization, parsing and such. What is the order in which they are executed and what do they do in particular?

EDIT: I know what compilation, linking and preprocessing do, I'm mostly interested in the others and the order. Explanations for these are, of course, also welcomed since I might not be the only one interested in an answer.

Best Answer

Are the stages of compilation of a C++ program specified by the standard?

Yes and no.

The C++ standard defines 9 "phases of translation". Quoting from the N3242 draft (10MB PDF), dated 2011-02-28 (prior to the release of the official C++11 standard), section 2.2:

The precedence among the syntax rules of translation is specified by the following phases [see footnote].

  1. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. [SNIP]
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. [SNIP]
  3. The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters (including comments). [SNIP]
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. [SNIP]
  5. Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set; [SNIP]
  6. Adjacent string literal tokens are concatenated.
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. (2.7). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit. [SNIP]
  8. Translated translation units and instantiation units are combined as follows: [SNIP]
  9. All external entity references are resolved. Library components are linked to satisfy external references to entities not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

[footnote] Implementations must behave as if these separate phases occur, although in practice different phases might be folded together.

As indicated by the [SNIP] markers, I haven't quoted the entire section, just enough to get the idea across.

To emphasize, compilers are not required to follow this exact model, as long as the final result is as if they did.

Phases 1-6 correspond more or less to the preprocessor, 7 to what you might normally think of as compilation, 8 deals with templates, and 9 corresponds to linking.

(C's translation phases are similar, but #8 is omitted.)