Go to the first, previous, next, last section, table of contents.

The GPC Source Reference

"The Source will be with you. Always."

This section tells you how to look up additional information about the GNU Pascal compiler from its source code. It replaces chapters like "syntax diagrams" you probably know from the documentation of other compilers.

Proprietary compilers often come with a lot of technical information about the internals of the compiler. This is necessary because their vendors want to avoid to distribute the source of the compiler--which is always the most definitive source of this technical information.

With GNU compilers, on the other hand, you are free to get the source code, look how your compiler works internally, customize it for your own needs, and to re-distribute it in modified or unmodified form. You may even take money for this re-distribution. (For details, see the GNU General Public License, section GNU GENERAL PUBLIC LICENSE.)

The following subsections are your guide to the GNU Pascal source code. If you have further questions, be welcome to ask them at the GNU Pascal mailing list (see section Where to get support for GNU Pascal).

All file paths mentioned in this chapter are relative to the GNU Pascal source directory, usually a subdirectory `p' of the GCC source. For instance, `parse.y' means `/usr/local/src/gcc-2.8.1/p/parse.y' (or whereever you have installed the GCC source on your machine); `../tree.def' means `/usr/local/src/gcc-2.8.1/tree.def', etc.

(Under construction.)

For more information, see chapters "Portability" through "Fragments" in section `' in "Using and Porting GNU CC".

GPC's Lexical Analyzer

The source file `gpc-lex.c' contains the so-called lexical analyzer of the GNU Pascal compiler. (For those of you who know `flex': This file was not created using `flex' but is maintained manually.) This very-first stage of the compiler is responsible for reading what you have written and dividing it into tokens, the "atoms" of each computer language. The source `gpc-lex.c' essentially contains one large function, `yylex()'.

Here is, for example, where the real number `3.14' and the subrange `3..14' are distinguished, and where Borland-style character constants like `#13' and `^M' are recognized. This is not always a trivial task, for example look at the following type declaration:

  Type
    X = ^Y;
    Y = packed array [ ^A..^B ] of Char;
    Z = ^A..^Z;

If you wish to know how GPC distinguishes the pointer forward declaration `^Y' and the subrange `^A..^Z', see `gpc-lex.c', function `yylex()', `case '^':' in the big `switch' statement.

There are several situation where GPC's lexical analzyer becomes context-sensitive. One is described above, another example is the token `protected', a reserved word in ISO-10206 Extended Pascal, but an ordinary identifier in ISO-7185 Standard Pascal. It appears in parameter lists

  Procedure foo ( protected bar: Integer );

and says that the parameter `bar' must not be changed inside the body of the procedure.

OTOH, if you write a valid ISO-7185 Standard Pascal program, you can declare a parameter `protected':

  Procedure foo ( protected, bar: Integer );

Here both standards contradict each other. GPC solves this problem by checking explicitly for "protected" in the lexical analyzer: If a comma or a colon follows, this is an ordinary identifier, otherwise it's a reserved word. Having this, GPC even understands

  Procedure foo ( protected protected: Integer );

without losing the special meaning of `protected' as a reserved word.

The responsible code is in `gpc-lex.c'---look out for `PROTECTED'.

If you ever encouter a bug with the lexical analyzer--now you know where to hunt for it.

Language Definition: GPC's Parser

The file `parse.y' contains the "bison" source code of GNU Pascal's parser. This stage of the compilation analyzes and checks the syntax of your Pascal program, and it generates an intermediate, language-independent code which is then passed to the GNU back-end.

The bison language essentially is a machine-readable form of the Backus-Naur Form, the symbolic notation of grammars of computer languages. "Syntax diagrams" are a graphical variant of the Backus-Naur Form.

For details about the "bison" language, see section `' in the Bison manual. A short overview how to pick up some information you might need for programming follows.

Suppose you have forgotten how a variable is declared in Pascal. After some searching in `parse.y' you have found the following:

  /* variable declaration part */

  variable_declaration_part:
            VAR variable_declaration_list semi
          | VAR semi
                  { error ("missing variable declaration"); }
          ;

  variable_declaration_list:
            variable_declaration
          | variable_declaration_list semi variable_declaration
                  { yyerrok; }
          | error
          | variable_declaration_list error variable_declaration
                  {
		    error("missing semicolon");
                    yyerrok;
		  }
          | variable_declaration_list semi error
          ;

Translated into English, this means: "The variable declaration part consists of the reserved word `Var' followed by a `variable declaration list' and a semicolon. A semicolon immediately following `Var' is an error. A `variable declaration list' in turn consists of one or more `variable declarations', separated by semicolons." (The latter explanation requires that you understand the recursive nature of the definition of `variable_declaration_list'.)

Now we can go on and search for `variable_declaration'.

  variable_declaration:
            id_list
                  {
                    [...]
                  }
            enable_caret ':' optional_qualifier_list type_denoter
            absolute_or_value_specification
                  {
                    [...]
                  }
          ;

(The `[...]' are placeholders for some C statements which aren't important for understanding GPC's grammar.)

From this you can look up that a variable declaration in GNU Pascal consists of an "id list", followed by "enable_caret" (whatever that means), a colon, an "optional qualifier list", a "type denoter", and an "absolute or value specification". Some of these parts are easy to understand, the others you can look up from `parse.y'. Remember that the reserved word `Var' preceeds all this, and a semicolon follows all this.

Now you know the procedure how to get the exact grammar of the GNU Pascal language from the source.

The C statements, not shown above, are in some sense the most important part of the bison source, because they are responsible for the generation of the intermediate code of the GNU Pascal front-end, the so-called tree nodes. For instance, the C code in "type denoter" juggles a while with variables of the type `tree', and finally returns (assigns to `$$') a so-called tree list which contains the information about the type. The "variable declaration" gets this tree list (as the argument `$6') and passes the type information to the C function `declare_vars()' (declared in `util.c'). This function `declare_vars()' does the real work of compiling a variable declaration.

This, the parser, is the place where it becomes Pascal.

Tree Nodes

If you want really to understand how the GNU Pascal language front-end works internally and perhaps want to improve the compiler, it is important that you understand GPC's internal data structures.

The data structure used by the language front-end to hold all information about your Pascal program are the so-called "tree nodes". (Well, it needn't be Pascal source--tree nodes are language independent.) The tree nodes are kind of objects, connected to each other via pointers. Since the GNU compiler is written in C and was created at a time where nobody really thought about object-orientated programming languages yet, a lot of effort has been taken to create these "objects" in C.

Here is an extract from the "object hierarchy". Omissions are marked with "..."; nodes in parentheses are "abstract": They are never instantiated and aren't really defined. They only appear here to clarify the structure of the tree node hierarchy. The complete list is in `../tree.def'; additional information can be found in `../tree.h'.

  (tree_node)
  |
  |--- error_mark  (* enables GPC to continue after first error *)
  |
  |--- (identifier)
  |    |
  |    |--- identifier_node
  |    |
  |    \--- op_identifier
  |
  |--- tree_list  (* general-purpose "container object" *)
  |
  |--- tree_vec
  |
  |--- block
  |
  |--- (type)  (* information about types *)
  |    |
  |    |--- void_type
  |    |
  |    |--- integer_type
  |   ...
  |    |
  |    |--- record_type
  |    |
  |    |--- function_type
  |    |
  |    \--- lang_type  (* for language-specific extensions *)
  |
  |--- integer_cst  (* an integer constant *)
  |
  |--- real_cst
  |
  |--- string_cst
  |
  |--- complex_cst
  |
  |--- (declaration)
  |    |
  |    |--- function_decl
  |   ...
  |    |
  |    |--- type_decl
  |    |
  |    \--- var_decl
  |
  |--- (reference)
  |    |
  |    |--- component_ref
  |   ...
  |    |
  |    \--- array_ref
  |
  |--- constructor
  |
  \--- (expression)
       |
       |--- modify_expr  (* assignment *)
       |
       |--- plus_expr  (* addition *)
      ...
       |
       |--- call_expr  (* procedure/function call *)
       |
       |--- goto_expr
       |
       \--- loop_expr  (* for all loops *)

Most of these tree nodes--struct variables in fact--contain pointers to other tree nodes. A `tree_list' for instance has a `tree_value' and a `tree_purpose' slot which can contain arbitrary data; a third pointer `tree_chain' points to the next `tree_list' node and thus allows us to create linked lists of tree nodes.

One example: When GPC reads the list of identifiers in a variable declaration

  Var
    foo, bar, baz: Integer;

the parser creates a chain of `tree_list's whose `tree_value's hold `identifier_node's for the identifiers `foo', `bar', and `baz'. The function `declare_vars()' (declared in `util.c') gets this tree list as a parameter, does some magic, and finally passes a chain of `var_decl' nodes to the back-end.

The `var_decl' nodes in turn have a pointer `tree_type' which holds a `_type' node--an `integer_type' node in the example above. Having this, GPC can do type-checking when a variable is referenced.

For another example, let's look at the following statement:

  baz:= foo + bar;

Here the parser creates a `modify_expr' tree node. This node has two pointers, `tree_operand[0]' which holds a representation of `baz', a `var_decl' node, and `tree_operand[1]' which holds a representation of the sum `foo + bar'. The sum in turn is represented as a `plus_expr' tree node whose `tree_operand[0]' is the `var_decl' node `foo', and whose `tree_operand[1]' is the `var_decl' node `bar'. Passing this (the `modify_expr' node) to the back-end results in assembler code for the assignment.

If you want to have a closer look at these tree nodes, write a line `(*$debug-tree="Foobar"*)' into your program with `FooBar' being some identifier in your program. (Note the capitalization of the first character in the internal representation.) This tells GPC to output the `identifier_local_value' tree node--the meaning of this identifier--to the standard error device in human-readable form.

While hacking and debugging GPC, you will also wish to have a look at these tree nodes in other cases. Use the `debug_tree()' function to do so. (In fact `(*$debug-tree="Foobar"*)' does nothing else than to `debug_tree()' the `identifier_local_value' of the `Foobar' identifier node.)

Parameter Passing

GPC supports a lot of funny things in parameter lists: `protected' and `const' parameters, strings with specified or unspecified length, conformant arrays, objects as `Var' parameters, etc. All this requires sophisticated type-checking; the responsible function is `convert_arguments()' in the source file `gpc-typeck.c'. Every detail can be looked up from there.

Some short notes about the most interesting cases follow.

Conformant arrays:
First, the array bounds are passed (an even number of parameters of an ordinal type), then the address of the array itself.
Procedural parameters:
These need special care because a function passed as a parameter can be confused with a call to the function whose result is then passed as a parameter. See also the functions `maybe_call_function()' and `probably_call_function()' in `util.c'.
Chars:
According to ISO-10206 Extended Pascal, formal char parameters accept string values. GPC does the necessary conversion implicitly. The empty string produces a space.
Strings and schemata:
Value parameter strings and schemata of known size are really passed by value. Value parameter strings and schemata of unknown size are passed by reference, and GPC creates temporary variable to hold a copy of the string.
`Const' parameters:
If a constant value is passed to a `Const' parameter, GPC assigns the value to a temporary variable whose address is passed.
Typeless parameters:
These are denoted by `Var foo' or `Var foo: Void' and are compatible to C's `void *' parameters; the size of such entities is not passed. Maybe we will change this in the future and pass the size for `Var foo' parameters whereas `Var foo: Void' will remain compatible to C.
`CString' parameters:
GPC implicitly converts any string value such that the address of the actual string data is passed. However, GPC does not implicitly append a `chr ( 0 )' terminator, except for string constants.

GPI files--GNU Pascal Interfaces

This section documents the mechanism how GPC transfers information from the exporting Modules and Units to the Program, Module or Unit which imports (uses) the information.

A GPI file contains a precompiled GNU Pascal Interface. "Precompiled" means in this context that the Interface already has been parsed (i.e. the front-end has done its work), but that no assembler output has been produced yet.

The GPI file format is an implementation-dependent (but not too implementation-dependent ;-) file format for storing GNU Pascal Interfaces to be exported--Extended Pascal and PXSC module interfaces as well as interface parts of Borland Pascal Units compiled with GNU Pascal.

To see what information is stored in or loaded from a GPI file, run GPC with an additional command-line option `--debug-gpi'. Then, GPC will write a human-readable version of what is being stored/loaded to the standard error device. (See also: section Tree Nodes.)

While parsing an Interface, GPC stores the names of exported objects in tree lists--in `gpc-parse.y', the bison (yacc) source of GPC's parser, search for `handle_autoexport'. At the end of the Interface, everything is stored in one or more GPI files. This is called in `gpc-parse.y'---search for `create_gpi_files'. (See also: section Language Definition: GPC's Parser, for an introduction to `gpc-parse.y')

Everything else is done in gpc-module.c. Here you can find the source of `create_gpi_files()' which documents the file format: First, a header of 33 bytes containing the string `GNU Pascal Module/Unit Interface\n' is stored, then the name of the primary source file of the module as a string, then the name of the exported interface as a tree node (see below), after that all exported names in the order as they were stored while parsing.

The names and the objects (i.e. constants, data types, variables and functions) they refer to are internally represented as so-called tree nodes as defined in the files `../tree.h' and `../tree.def' from the GNU compiler back-end. (See also: section Tree Nodes.) The names are stored as `identifier_node's, their meanings as `identifier_global_value's of these nodes. The main problem when storing tree nodes is that they form a complicated tree in memory with a lot of circular references making it hard to decide which information must be stored and which mustn't.

The functions `load_tree()' and `store_tree' load/store a tree node with the contents of all its contained pointers in a GPI file.

Each tree node has a `tree_code' indicating what kind of information it contains. Each different tree node must be stored in a different way. See the source of `load_tree()' and `store_tree()' for details.

Most tree nodes contain pointers to other tree nodes; therefore `load_tree()' and `store_tree()' are recursive functions. The `--debug-gpi' debugging informations contains the recursion level in parantheses, e.g. `loaded (2):' means that the loaded information was requested by a pointer contained in a tree node requested by a pointer contained in a tree node representing an exported symbol.

Since this recursion can be circular (think of a record containing a pointer to a record of the same type), we must resolve references to tree nodes which already have been loaded. For this reason, all nodes are recorded in a hash buffer--see `gpi-hash.c'.

There are some special tree_nodes (e.g. `integer_type_node' or `NULL_TREE') which are used very often. They have been assigned (normally invalid) unique `tree_code's, so they can be stored in a single byte.

That's it. Now you should be able to "read" GPI files using GPC's `--debug-gpi' option. If you encounter a case where the loaded information differs too much from the stored information, you have found a bug--congratulations! What "too much" means, depends on the object being stored in or loaded from the GPI file. Remind that the order things are loaded from a GPI file is the reversed order things are stored when considering different recursion levels, but the same order when considering ths same recursion level.

GPC's AutoMake Mechanism--How it Works

When a Program/Module/Unit imports (uses) an Interface, GPC searches for the GPI file (see section GPI files--GNU Pascal Interfaces) derived from the name of the Interface.

Case 1: A GPI file was found.

Each GPI file contains the name of the primary source file (normally a `.pas' or `.p' file) of the Module/Unit, and the names of all interfaces imported. GPC reads this information and invokes itself with a command like

  gpc foo.pas -M -o foo.d

This means: preprocess the file, and write down the name of the object file and those of all its source files in `foo.d'. GPC reads `foo.d' and looks if the object file exists and if the source was modified since the creation of the object file and the gpi file. If so, GPC calls itself again to compile the primary source file. When everything is done, the `.d' file is removed. If there was no need to recompile, all interfaces imported by the Module/Unit are processed in the same way as this one.

Case 2: No GPI file was found.

In this case, GPC derives the name of the source file from that of the Interface by trying first `interface.p', then `interface.pas'. This will almost always work with Borland Pascal Units, almost never with Extended Pascal Modules. With Extended Pascal, compile the Module once manually in order to produce a GPI file.

All this is done by the function `gpi_open()' which uses some auxiliary functions such as `module_must_be_recompiled()' and `compile_module()'.

Each time an object file is compiled or recognized as being up-to-date, its name is stored in a temporary file with the same base name as all the other temporary files used by GPC but the extension `.gpc'. When the top-level `gpc' is invoked (which calls `gpc1' later on), it passes the name of this temporary file as an additional command line parameter to `gpc1'. After compilation has been completed, the top-level `gpc' reads the temporary file and adds the new object files to the arguments passed to the linker.

The additional command `--amtmpfile' (not to be specified by the user!) is passed to child GPC processes, so all compiles use the same temporary file.

The source for this is merely in `gpc-module.c', but there are also some hacks in `gpc.c', additional command line options in `lang-options.h' and `gpc-decl.c', and `gpc-defs.h' contains declarations for the functions and global variables.


Go to the first, previous, next, last section, table of contents.