Symbol-Table-Handling Techniques (Compiler Writing) Part 1

Symbol tables (also called identifier tables and name tables) assist two important functions in the translation process: in checking for semantic (i.e., context-sensitive) correctness and aiding in the proper generation of code. Both of these functions are achieved by inserting into, and retrieving from the symbol table, attributes of the variables used in the source program. These attributes, such as the name, type, object-time address, and dimension of a variable, are usually found explicitly in declarations or more implicitly through the context in which the variable name appears in the program. In this topic we do not describe how declarations, per se, are handled but describe the routines for organizing, creating, and searching symbol tables. It is these routines which are invoked when declarations are handled at compile time. Declaration handling will be discussed in Chap. 11.

Table-handling techniques for both block-structured and non-block-structured languages are described in this topic.

PERSPECTIVE AND MOTIVATION

In topic 4, we described how the scanner identified different lexical classes for the constants, identifiers, reserved words, and operators of a language. Recall that identifiers were detected by first searching for a lexical unit, as isolated by the scanner, in a reserved-word table. If the lexical unit was found, its index value in the table was output by the scanner. This index value is used in subsequent parts of the compilation process to identify the reserved word that was isolated. If the lexical unit was not found in the table, it was assumed to be an identifier, and a token indicating this was output from the scanner.

Our discussion in this topic continues from Chap. 4 in the sense that we describe how the attributes of a variable are associated with a particular variable name (i.e., identifier) using a symbol table. It must be noted that a symbol table is generally a volatile table because entries are being added continually to and, in some cases, deleted from the table. While some of the techniques to be discussed in this topic, such as the binary search and hashing, can be applied to both static and volatile tables, maintaining a volatile symbol table is more difficult than maintaining a static table such as the reserved-word table.

The importance of a symbol table is best realized when we consider that every occurrence of an identifier in a source program requires some symbol-table interaction. Because of this continual interaction, symbol-table access may consume a major portion of the processor time during compilation. A study, cited by McKeeman (1974), indicated that for an efficient translator of a PL/I-like language XPL one-fourth of the translation time was devoted to interrogating the symbol table when a linear search method was employed. When the symbol-table access method was changed to a more efficient hash-table technique, the majority of the time previously expended in table accessing was saved. Hence we are motivated to study in some detail methods for efficient symbol-table handling.

This topic describes a number of symbol-table organizations for both nested and nonnested languages. Techniques involving ordered tables, tree-structured tables, and hash tables are presented. Routines for inserting and looking up table elements are discussed for each organization. These insertion and lookup routines are invoked when declarations are handled at compile time—precisely how and when they are invoked will not be considered in this topic but will be examined in Chap. 11 in the section dealing with declaration handling.

WHEN TO CONSTRUCT AND INTERACT WITH THE SYMBOL TABLE

The point in the translation process at which the symbol-table handling routines are invoked depends primarily on the number and the nature of the passes in the compiler. The following discussion is somewhat simplistic but nevertheless illustrates the reasons for this dependency.

In a multipass compiler, such as the one depicted in Fig. 8-1, the symbol table is created during the lexical-analysis (scanning) pass. Index entries for variables in the symbol table form part of the token string produced by the scanner. For example, in Fig. 8-1 when X and Y occupy symbol-table positions 1 and 2, respectively, a token string such as /1:= /2 + /I is created. The syntactic-analysis pass receives the token string, checks for syntactic correctness, and generates a parse tree or some encoded form of the parse tree.

Figure 8-1 Multipass compiler configuration.

This encoded form is then analyzed for semantic correctness (i.e., context-dependent correctness) and is used in the code-generation phase to generate a series of object-code instructions. The leaves of the parse tree contain indices into the symbol table. Note that no table-handling routines are used during the syqtactic-analysis phase. It is not until the semantic-checking and code-generatiop phases that many of the attributes associated with a variable can be assigned values in the symbol table. For example, in a« language with explicit declarations, the type of a variable can be assigned only when it is recognized that a declaration statement is being compiled. It is in the code-generation phase that the full context of the declaration (i.e., both the identifier name and its type) is known through the sequence of parsing actions performed by the syntactic analyzer.

Attempts can be made to assign attributes to the symbol table at other points in the translation process, most notably in the lexical-analysis phase. Such an approach, however, necessitates that soipe syntactic and semantic (i.e., context-sensitive) analysis for declarations be placed in the scanner. This strategy tends to produce a fragmented compiler in the sense that the functions of lexical analysis, syntactic analysis, and semantic analysis arp intermixed in several modules of the compiler. It is generally conceded that the intermixing of functions produces an unstructured compiler and is not conducive to good software-engineering practices (McKeeman, 1973).

A second approach to symbol-table handling is illustrated in Fig. 8-2. The lexical-analysis, syntactic-analysis, and code-generation phases are completed in one pass. As a result, it is possible in the code-generation phase to recognize that a declaration statement is being processed before the entire source statement is scanned. This helps immensely, since the attributes of a variable, as dictated by the declaration statement, can be placed in the table as they are identified during code generation.

Figure 8-2 Combined-pass compiler configuration.

Therefore, in the second approach, the only module which needs to interact with the symbol table is the code generator.

An exception to this rule occurs when certain context-dependent information is required during compilation to assist in parsing activities. For example, in many programming languages it is convenient to recognize, with a table lookup in the scanner, the type of a particular identifier. With this information the scanner can pass on to the parser a more meaningful token such as "real identifier" or "integer identifier" rather than just "identifier." This strategy has two benefits. It reduces the complexity of the grammar for the parser (certain conflicts generated by using the general syntactic construct "identifier" can be resolved), and it allows a more systematic approach to error detection and recovery for context-sensitive types of errors.

In summary, it is suggested that in multipass compilers variable names should be inserted into the symbol table during lexical analysis and the other attributes of the variable should be assigned during code generation. If the lexical-analysis, syntactic-analysis, and code-generation phases are combined in one pass, all symbol-table interaction can be confined to the code-generation phase.

SYMBOL-TABLE CONTENTS

A symbol table is most often conceptualized as a series of rows, each row containing a list of attribute values that are associated with a particular variable. Figure 8-3 illustrates this conceptualization. The kinds of attributes appearing in a symbol table are dependent to some degree on the nature of the programming language for which the compiler is written. For example, a language may be typeless, and therefore the type attribute need not appear in the table. Similarly, the organization of the symbol table will vary depending on memory and access-time constraints. This point is exemplified throughout this topic.

	Variable Name	Address	Type	Dimension	Line Declared	Lines Referenced	Pointer
1	COMPANY#	0	2	» 1	2	9, 14,25	7
2	X3	4	1	0	3	12,14	0
3	FORM 1	8	3	2	4	36,37,38	6
4	B	48	1	0	5	10, 11,13, 23	1
5	ANS	52	1	0	5	11,23,25	4
6	M	56	6	0	6	17,21	2
7	FIRST	64	1	0	7	28, 29, 30, 38	3

Figure 8-3 Typical view of a symbol table.

The following list of attributes are not necessary for all compilers; however, each should be considered for a particular compiler implementation.

1. Variable name i

2. Object-code address

3. Type

4. Dimension or number of parameters for a procedure

5. Source line number at which the variable is declared

6. Source line numbers at which the variable is referenced

7. Link field for listing in alphabetical order

In the remainder of this section, each of the attributes is considered with respect to when it should be included and what problems arise in representing these attributes in a table.

A variable’s name must always reside in the symbol table, since it is the means by which a particular variable is identified for semantic analysis and code generation. A major problem in symbol-table organization can be the variability in the length of identifier names. For languages such as BASIC with its one- and two-character names and FORTRAN with names up to six characters in length, this problem is minimal and can usually be handled by storing the complete identifier (padded to the right with blanks if necessary) in a fixed-size maximum-length field. However, as pointed out in Chap. 3, such severe restrictions on variable names are unwarranted.

While there are many ways of handling the storage of variable names, two popular approaches will be outlined—one which facilitates quick table access and another which supports the efficient storage of variable names. To provide quick access, it is best to insist on a predefined, yet sufficiently large, maximum variable name length. A length of sixteen or greater is very likely adequate. The complete identifier can then be stored, left-justified, in a fixed-length field in the symbol table. With this straightforward approach, table access is fast but the storage of short variable names is inefficient.

A second approach is to place a string descriptor in the variable-name field of the table. The descriptor contains position and length subfields, as shown in Fig. 8-4. The pointer subfield indicates the position of the first character of the variable name in a general string area, and the length subfield describes the number of characters in the variable name. A table access must always go through the descriptor to the string area for variable-name matching. Therefore, this approach results in slow table access, but the savings in storage can be considerable.

In the compilation process, an object-code address must be associated with every variable in a program. This address dictates the relative location for values of a variable at run time. The object-code address is entered in the symbol table when a variable is declared (or first encountered). This address is recalled from the table when the variable is referenced in the source program. The address is then used in an object instruction that accesses (loads or stores) the value of that variable.

Figure 8-4 Using a string descriptor to represent a variable name.

For a language without dynamic storage-allocation requirements, such as FORTRAN, the object-code address is assigned in a sequential order starting at 1 and proceeding to m, where m is the maximum size of the data area allotted to a program. For block-structured languages, a 2-tuple address containing a block-level (BL) field and an occurrence number (ON) field is commonly adopted (see Randell and Russell, 1964). At run time the block-level value helps to locate the base of the data area allocated to the block in which the variable is declared. The occurrence number indicates the offset into the data area at which the value of the variable is stored. For now, it is not important to understand these object-code addressing schemes thoroughly—they will be elaborated upon in Chap. 9. This brief discussion should, however, aid iij the understanding of why it is important that the object-code address should be included as a symbol-table attribute.

The type attribute is stored in the symbol table when compiling languages having either implicit or explicit data types. Of course, for typeless languages such as BLISS (Wulf et al., 1971), this attribute is excluded. FORTRAN provides an example of what is meant by implicit data typing. Variables which are not declared to be a particular type are assigned default types implicitly (variables with names starting with I, J, K, L, M, or N are integers; all other variables are real). The type of the variable, whether it be assigned implicitly or explicitly, is most important for semantic checking. If a variable, say S, is declared to be of type string, then the expression "S*3.14" should be detected as erroneous (or at least a warning should be posted), since in most languages it is semantically inconsistent to multiply a string by a number. Numerous other examples of semantic-type checking will be discussed in Chap. 11. The type of a variable is also used as an indication of the amount of memory that must be allocated to a variable at run time. For example, if the type of a variable is integer, then a single word may be allocated in the data area; whereas, if the type is real, a double word may be required. Typically, the type of a variable is stored in the symbol table in an encoded form; that is, real is encoded as 1, integer is encoded as 2, etc.

The number-of-dimensions attribute and the number-of-parameters attribute are both important for semantic checking. In array references, the number of dimensions should agree with the number specified in the declaration of the array, and this agreement must be validated during semantic analysis. The number of dimensions is also used as a parameter in a generalized formula for calculating the address of a particular array element. In Chap. 9, we describe precisely how this is accomplished. The number of parameters in a procedure call must also agree with the number used in the procedure heading or declaration. In fact, in symbol-table construction it is convenient to view the number of parameters for a procedure as its number of dimensions and thus combine these two attributes into one. In addition to being convenient, this approach is also consistent, since the type of semantic checking for both attributes is similar.

In the example figures given in this topic simple variables (i.e., scalars) are considered to be of dimension 0, vectors of dimension 1, matrices of dimension 2, etc.

A very important programming aid that can be provided by a compiler is a cross-reference listing. This listing contains many of the attributes we have discussed already, plus the source line number at which a variable is declared (if explicitly declared) or first referenced (if implicitly declared) and the source line numbers of all other references to the variable. Figure 8-5 illustrates a typical cross-reference listing. Note that problems arise when trying to represent a list of all source line-number references. Generally some form of linked representation into an area separate from the table is adopted to handle this problem.

Name	Type	Dimension	Declared at	Reference
ANS	Real	0	5	11,23,25
B	Real	0	5	10, 11, 13, 23
COMPANY*	Int	1	2	9, 14, 25
FIRST	Real	0	7	28, 29, 30, 38
FORM1	Char	2	4	36, 37, 38
M	Proc	0	6	17, 21
X3	Real	0	3	12, 14

Figure 8-5 Example of a cross-reference listing

The final attribute, which we call the link field, is included simply to facilitate the production of a cross-reference listing that is ordered alphabetically by variable name. If a cross-reference listing is not to be included as one of the compiler aids, the link field as well as the source line-number attributes can be omitted from the symbol table.

Let us now look in more detail at the operations performed on a symbol table.

OPERATIONS ON SYMBOL TABLES

The two operations that are most commonly performed on symbol tables are insertion and lookup (also referred to as retrieval). The nature of each of these operations differs slightly depending on whether or not the language being compiled has explicit declarations.

For languages in which explicit declaration of all variables is mandatory, the two operations are invoked at clearly defined points in the compilation. It is obvious that an insertion operation is required when processing a declaration, since a declaration is intended to be an initial description of a variable’s attributes in the program. If the symbol table is ordered, say, alphabetically by variable name (such symbol-table organizations will be described in the next section), then an insertion operation may also involve a lookup to find the locations at which the variable’s attributes are to be placed. In such a situation an insertion is at least as expensive as a retrieval. If the symbol table is not ordered, the insertion operation is greatly simplified, since an initial lookup phase may be avoided; but the retrieval is expensive, since we may have to examine the entire table.

Retrieval operations are performed for all references to variables which do not involve declaration statements. The retrieved information (i.e.. the type, object-code address, number of dimensions, etc.) is used for semantic checking and code generation as indicated in the previous section. Retrieval operations for variables which have not been previously declared are detected at this stage and appropriate error messages or warnings can be emitted. Some recovery from such semantic errors can be achieved by posting a warning message and incorporating the nondeclared variable in the symbol table. The attributes associated with such variables would have to be deduced, as much as possible, through the context in which the variable is referenced.

When a programming language permits implicit declarations of variables, the operations of insertion and retrieval are closely linked. If declarations are not explicit and mandatory, any variable reference must be treated as an initial reference, since there is no way of knowing a priori if the variable’s attributes have been entered in the symbol table. Hence any variable reference generates a lookup operation followed by an insertion if the variable’s name is not found in the symbol table. All attributes associated with an implicitly declared variable must be deduced from the variable’s role in the program.

For block-structured languages, two additional operations, which we denote as set and reset, are required for symbol-table interaction. The set operation is invoked when the beginning of a block is recognized during compilation. The complementary operation, the reset operation, is applied when the end of a block is encountered. The nature of and the need for these operations can be illustrated by examining the program segment in Fig. 8-6.

In this figure, the variable X is declared in more than one block; in each of these blocks, X assumes different attributes. In a nested language, it is necessary to ensure that each instance of a variable name is associated with a set of unique table locations for storing the variable’s attributes. To handle this problem adequately, two types of adjustments—as realized in the set and reset operations are required in the symbol table.

Upon block entry, the set operation establishes a new subtable (within the symbol table) in which the attributes for the variables declared in the new block can be stored. Exactly how this new subtable is established is dependent upon the symbol-table organization.

Figure 8-6 Program segment from a block-structured language.

A number of different techniques for subtable creation will be discussed in Sec. 8-6.

Because a new subtable is established for each block, the duplicate variable-n^me problem can be resolved in the following manner. Assume a table-lookup operation is performed for a variable X and the subtables that are currently active are numbered 1,2,…,« according to the order in which they were created (i.e., subtable 1 was the first to be created, subtable n was the last to be created). If the search begins at subtable n and proceeds to subtable 1, then the desired variable, that is, the latest occurrence of X, is located first. In this manner, any ambiguity associated with duplicate names is resolved. Note that duplicate occurrences of a variable name in the same block are not allowed.

Upon block exit, the reset operation removes the subtable entries for the variables of the completed block. The removal of these entries reflects the fact that the variables in the terminated block can no longer be referenced. Again, the different methods for removing table entries are dependent on the symbol-table organization and hence are discussed in Sec. 8-6.

Now that the types of operations on a symbol table have been described, we are in a position to consider a number of different symbol-table organizations. In particular, an examination will be undertaken of how efficiently the symbol-table operations can be performed for each organization.

SYMBOL-TABLE ORGANIZATIONS FOR NON-BLOCK-STRUCTURED LANGUAGES

In this section, a number of symbol-table organizations are described for non-block-structured languages. By a non-block-structured language we mean a language in which each separately compiled unit is a single module that has no submodules. All variables declared in a module are known throughout the module.

Many of the organizations discussed in this section can be applied to block-structured languages also. These organizations along with some additional symbol-table methods for nested languages will be described in Sec. 8-6.

The organizations presented in this section proceed from the conceptually simple organizations, which generally are storage-efficient, to the more complex organizations, which provide fast table access. The primary measure which is used to determine the complexity of a symbol-table operation is the average length of search. This measure is the average number of comparisons required to retrieve a symbol-table record in a particular table organization. The collection of attributes stored in the table for a given variable will be called a symbol-table record. The name of the variable for which an insertion or lookup operation is to be performed will be referred to as the search argument.

Unordered Symbol Tables

The simplest method of organizing a symbol table is to add the attribute entries to the table in the order in which the variables are declared. Of course, if variables are declared implicitly, attributes are added according to the order in which variables are encountered during compilation.

Let us first examine the insertion and lookup operations for languages with explicit declarations. In an insertion operation no comparisons are required (other than checking for symbol-table overflow), since the attributes for a newly defined variable are simply added at the current end of the table. This procedure is somewhat idealistic, however, because it ignores the problem of duplicate variable-name declarations. To check for this type of error, a complete table search must precede the insertion operation. The lookup operation requires, on the average, a search length of (« + l)/2, assuming that there are n records in the table. This result can be derived easily by noting that the first record in the table is accessible in one comparison, the second record is accessible in two comparisons, and so on. Therefore, the expected length of search is

Of course, if the element is not found, an error has occurred and a recovery strategy, such as one of those discussed in the previous section, is necessary.

For a language with implicit declarations, an insertion operation must always be preceded by a lookup operation. In fact, it is the absence of a variable as determined by the lookup operation that indicates the variable’s attributes must be inserted in the symbol table. If this process requires a lookup followed by an insertion (in the case of a first occurrence of a variable), then n comparisons plus an insertion are necessary. If only a lookup is needed (in the case in which the variable was previously referenced), then an average of (n -I- l)/2 comparisons are used to locate the appropriate set of variable attributes. Therefore, if the ratio of first variable references to total variable references is denoted as p, the lookup and insertion process requires, on the average, comparisons.

An unordered table organization should be used only if the expected size of the symbol table is small, since the average time for lookup and insertion is directly proportional to the table size. In the subsections to follow we will look at some organizations which reduce the search time substantially.