Compile-Time Error Handling Part 2

ERROR REPORTING

In order for effective error detection to be of any value to the users of a compiler, each error, once detected, must be clearly and precisely reported. To those who have never been involved in a large software project, it may not be obvious that error reporting requires special consideration. Small application programs may generate one of a dozen or so messages on the rare occasions when something goes wrong. Such programs can use the standard output routines or PRINT statements of the language in which they are programmed to print out a short message to the user. However, it is the premise of this entire topic that a compiler’s response to errors must be extremely well designed. Where a compiler is used primarily for program development, it may be called upon more often to generate good error messages than to generate good object code.

In the early days of high-level language translators when computing resources were scarce and the usefulness of the highly diagnostic compiler had not yet been demonstrated, it was common for a compiler to print error codes, rather than natural-language messages. When confronted with a message like ‘SY03′, the user would go to a topic of explanations provided by the authors of the compiler and would attempt to relate the description found there to the problem at hand. However, this proved inconvenient for usersvsince they always needed the error t to decipher the compiler’s cryptic messages. It also proved difficult to keep the multitude of error topics up to date as the compiler was (inevitably) modified. Furthermore, even when the description for ‘ SY031 was available and accurate, the user was often unable to understand just which objects in the program caused the message and why.

Today it is common for both educational and production compilers to generate natural-language messages. The generation of error codes in lieu of full messages is no longer justifiable, except on the smallest of microcomputers. One approach to generating natural-language messages is to produce error codes in the compiler, save these in a temporary file, and rely on a separate final pass to print the source listing and replace the error codes by the full text of the error messages. This approach is essentially an automation of the error-topic method and is probably the minimum acceptable standard of error reporting today. It removes the problems of the inavailability and inaccuracy of the error topics, but it does nothing to make the error descriptions more comprehensible.

The error messages already discussed contain no reference to the objects in the source program which are the cause of an error, and they often contain no information about the location of the error other than the number of the line on which it was detected. There are several ways in which we can improve on such error messages. Error messages should specifically mention any information known to the compiler which might be useful to the user in discerning the true cause of an error. Such information may include a visible pointer to the column at which the error was detected, the names and attributes of variables and constants associated with the error, the types involved in type-matching errors, and so on. All this information must be expressed in terms of the source language and must be oriented toward the user. In some cases it may be difficult for the compiler to provide this information, but it is usually easier for the compiler than for the user. The compiler writer should take the necessary pains to ensure that specific messages are printed to describe different errors, rather than relying on the possibly misinformed user to interpret one vague generic message which is used for several different errors. Messages such as "Bad statement syntax" are entirely unacceptable for any compiler which is to be used extensively.

Conway and Wilcox (1973) suggest an additional source of diagnostic information that is useful for compilers which perform effective error repair. They state that "the most effective communication rests not in the specific messages but in a reconstruction of the source statement …". When their PL/C compiler encounters an erroneous source statement, it states its interpretation of what the programmer intended to write, namely, the repaired form of the source statement. The repaired statement indicates what the error is in terms of how the statement can be made valid, and at the same time gives the user information helpful in the interpretation of later messages. For example, if the compiler reconstructs an invalid declaration in a way other than that intended by the programmer, later messages may be caused by the incorrect declaration. With the PL/C system of error reporting, the programmer can more easily understand the cause of the later errors.

Horning (1976) suggests that error messages should contain the insights about the language which the compiler writer is sure to have acquired during the process of implementation. An error message should contain suggestions about the cause of the error and how it might be fixed. This is the kind of suggestion made immediately by people experienced with a particular language and compiler whenever they see a certain error message. The major problem with this type of information in a syntax-directed compiler is that it is entirely empirical and cannot be generated by any sort of general algorithm. Thus it can be difficult to attach these personalized messages to a general-purpose syntax-directed parsing algorithm.

These elaborate and effective messages are considerably easier to propose than to implement. In the PL/C and PLUTO compilers for PL/I (Conway and Wilcox, 1973; Wagner, 1973; Boulton, 1975) the implementation approach has been the construction of a flexible message generator which interprets complex message patterns. The goal of this rather elaborate mechanism is to minimize the space consumed by the message patterns, while maximizing the flexibility, usefulness, and consistency of the error messages. The approach is a reasonable one, and we now present in some detail the design of a flexible message generator adapted from the message generators of PL/C and PLUTO.

Figure 5-1 Representation of an error-message-generation system.

Figure 5-1 depicts the layout of the two tables which control the generation of error messages. When an error is detected, the message generator is called and is given an ERROR # (error number) to indicate which message should be printed. The ERROR # is used as an index into the ERROR_TABLE. The row of the table selected by the index contains two fields. The LEVEL field indicates the level of severity of the error which caused the message. The MESGPTR field contains the index in MESG_TABLE of the message pattern for the error message.

The first action taken by the message generator is to compare the LEVEL of the message with the global variable MSGLEVEL, which indicates the lowest level of messages which are to be printed. The user is able to control the MSGLEVEL by means of a parameter of the compiler or by means of some construct in the language being compiled. For example, the user may easily suppress all warnings and mild error messages and have only the severe error messages printed. If the LEVEL of the message is less than the MSGLEVEL, the message is not printed, and the message generator returns. If the LEVEL is greater than or equal to the MSGLEVEL, the counts of total error messages and error messages with severity equal to LEVEL are incremented. Finally, the message generator interprets the message pattern indicated by MESGPTR, and prints the error message.

The message pattern consists of a string of bytes in the MESG_TABLE. Each pattern consists of a sequence of phrases. Each phrase begins with a letter, specifying what type of phrase it is, and may contain further bytes, depending on its type. The types of phrases are:

L- This letter indicates the literal text phrase and is followed by a byte containing the length of the literal text, and then by the literal text itself. M-This letter indicates a MESG_TABLE reference and is followed by a byte containing the index in MESG_TABLE of another message pattern to be interpreted before the remainder of the current pattern is interpreted. Thus the scheme is fully recursive, and a stack is necessary in the message generator to keep track of partially interpreted message patterns.The letter ‘ S’ introduces the suffix phrase, which is followed by a single character which is printed immediately after the preceding phrase, with no intervening blank or line feed. This feature may be used to form words such as ‘COMPILER’ and ‘COMPILED’ from roots such as ‘COMPILE’. T- This letter introduces the tab phrase and is followed by a byte containing the number of the column where the next phrase is to be printed. N- The letter ‘N’ represents the newline phrase, indicating that the next phrase is to be printed on a new output line. P- The 1P’ begins the parameter phrase and is followed by a byte containing the number of the string parameter of the message generator which is to be printed here. The routine which calls the message generator may pass it one or more string arguments containing the output forms of constants, identifiers, attribute names, and other information to be included in the message. A parameter phrase consisting of ‘P’ followed by the byte-sized integer 2 would indicate that the second of these arguments is to be printed. E- This letter represents the end phrase and acts as a terminator for the message pattern.

Figure 5-2 shows the layout of the error and message tables for a sample message. The error number passed to the message generator selects an entry in the ERROR_TABLE. That entry contains the MESGPTR value 4. The processing of the message begins with the interpretation of message pattern 4 in the MESG_TABLE. This pattern consists of five phrases, which are separated by blanks in the figure. The first phrase is ‘M’O, which causes the interpretation of message pattern 0. Message pattern 0 contains a phrase specifying a piece of literal text nine characters in length, namely, ‘IDENTIFIE’. The pattern is terminated by an ‘E’ phrase. After ‘IDENTIFIE’ is printed, the message generator resumes the interpretation of message pattera 4. The ‘ S” R’ phrase indicates that the suffix ‘R’ is to be placed on the preceding word. The ‘P’l phrase indicates that the first of the string parameters of the message generator call is to be printed. The ‘M’l phrase causes the interpretation of message pattern 1, which contains a literal text phrase for the word ‘ UNDEFINED1. The ‘E’ phrase signals the end of message pattern 4, and the message generator returns to its caller. If the first string parameter had the value ‘COUNT’, the complete message would be.

Figure 5-2 An error-message example.

IDENTIFIER COUNT UNDEFINED

The centralized message generator has several practical advantages. It promotes consistency in the form and content of error messages. Because several messages may have common submessages, the total space consumed by the message text is reduced. Without the central generator, there would be much redundant code associated with printing messages. Because the message-generation strategy is encapsulated in one module, it is easy to change. Furthermore, the centralized generator facilitates the collection of statistics concerning errors and messages, and increases the level of control which may be exercised over the error messages. For example, it becomes possible to count the number of error messages generated for a particular line and suppress further messages for that line if the count grows too large. If a large number of messages have been printed, the message generator may decide to suppress relatively unimportant messages, namely, comments and warnings. All of this is very difficult to do with a diffuse message-generation scheme.

Once a syntactic error has been properly reported, it is necessary to recover from it or repair it so that parsing may continue. Without this crucial step only one error can be detected in each compilation. Error recovery is the key to making a compiler truly useful in the debugging stage of program development.

ERROR RECOVERY

As shown in Sec. 5-1, a useful compiler must somehow be able to adjust its state after encountering an error so that the compilation may continue in a reasonable way through the remainder of the program. The difficulty in accomplishing this can be seen by comparing a parser to a pyramid delicately balanced on its apex. All goes well as long as there are no surprises for the parser. However, when a syntactic error is encountered in the source text, the pyramid begins to fall off balance. Depending on the parsing method used, the imbalance may be detected immediately or only after the pyramid has fallen into an even more unstable state. Attempting to ignore the error and blindly continue parsing can lead to the total collapse of the parse (and the metaphorical pyramid). The symptom of this is an avalanche of error messages caused by the original error and the increasingly unbalanced parser. The goal of any error-recovery method is to restore the parser to a stable state (rebalance the pyramid) so that the parse can continue in a meaningful way and further syntactic errors can be detected. Error recovery implies no attempt to change the erroneous program into an syntactically valid one.

A large number of different syntactic-analysis methods are presently in use. Therefore, it is difficult to make general comments about the error-recovery process. In the following discussion we develop a model of a syntactic analyzer and discuss error-recovery strategies for this model. More specific and detailed recovery methods for the commonly used syntactic-analysis techniques are discussed in the next two topics.

Many of the most popular parsing algorithms have the following elements:

1. A pushdown stack, used to store the sentential form being constructed by the parser.

2. An input string av a2,…, an containing the tokens of the source program.

3. One or more symbol tables, used to keep track of the attributes of variables and the identifiers which refer to them. These tables are used by the parser to check the context-sensitive syntax of the program.

In this model a syntactic error is detected when the current input symbol a, is invalid in the context in which it appears. In a left-to-right analysis the stack and the symbol tables represent the left context of the error and the unprocessed input string al + l, al + 2,…, an contains the right context of the error. Recovery is performed by modifying the stack and tables (left context), or the input string (right context), or both, so that the parser configuration is legal.

Ad Hoc Recovery Mechanisms

The approach taken to error recovery in most high-performance industrial compilers is an informal one. The portion of the compiler which detects an error calls some special-purpose error-recovery routine which is able to handle a particular error or class of errors. The action taken by the routine is decided by compiler writers when they code the routine. In this way compiler writers may anticipate a particular class of syntactic errors and recover from them in whatever way they consider most reasonable and effective.

The ad hoc recovery scheme works well as long as the syntactic errors encountered fall into one of the classes of errors anticipated by the compiler writer. Unfortunately, it is very difficult to anticipate all syntactic errors. Ad hoc recovery methods have a disadvantage in that they tend to react badly to unanticipated situations and may produce avalanches of error messages. This is related to the problem of ensuring that the compiler has a correct, bounded, and stable reaction to errors, as discussed in Sec. 5-1. A further disadvantage of ad hoc error-recovery methods is the large amount of effort which must go into the writing of the error-recovery routines and the careful anticipation of all possible syntactic errors.

It is possible to use ad hoc error-recovery techniques in a syntax-directed or table-driven parser. As mentioned in Sec. 5-2.2, errors are usually detected in table-driven parsers when the indicated parse action in the parsing table is error. To add ad hoc error recovery to such a parser, it is only necessary to change the error actions in the table into calls to hand-coded error-recovery routines. Thus a particular error or group of errors causes a particular routine to be called. This routine then modifies the stack, symbol tables, and/or input string to place the parser into a stable state again. This is roughly the approach used in the PL/C compiler for PL/I (Conway arid Wilcox, 1973).

The primary advantage of ad hoc recovery methods is that they allow compiler writers to put their intuitions about common syntactic errors into the error-recovery scheme. Instead of specializing the recovery algorithm, the compiler writer may add this information to the grammar in the form of error productions. Error productions are special productions which take effect only when a syntactic error is encountered. The parser and error-recovery system can remain general, while language-specific information is incorporated by means of the error productions. Wirth (1968) uses a table of error productions in his simple-precedence parser for PL360 to select an appropriate error message when an error occurs—the table is searched for a right-hand side which matches the top elements of the stack, and the message attached to this right-hand side is printed. The erroneous phrase matched by the error production is then discarded. Aho and Johnson (1974) use a special error token in their LALR(l) grammars to indicate to the parser a way of recovering from a syntactic error. When an error is detected, the parser looks through the input string until it finds some symbol which can legally follow error in the grammar. It then adjusts the stack and proceeds as if it had read the tokeh error in the input string.

The disadvantages of error productions as a means of recovery are similar to those of other language-dependent methods. Considerable time must be spent inserting error productions and checking that they work correctly. There is always the danger that some error situation will be overlooked and that the parser will react badly in unanticipated situations. Depending on the tabular form used for the grammar, it is possible that the added error productions may make the grammar prohibitively large and impossible to store.

Syntax-Directed Recovery

In view of the possibilities for oversights and mistakes in hand-coding an ad hoc error-recovery mechanism, it appears desirable to have some formalized general approach to generating an error-recovery mechanism. Much research has been devoted to the development of algorithms which operate on the data structures of context-free parsers of various types so as to adjust them to proceed after a context-free syntactic error. This section describes a general approach to syntax-directed error recovery (Rhodes, 1973; Graham and Rhodes, 1975) which is applicable to any bottom-up syntax-directed parser.

An error-recovery strategy can usually make use of only a small part of the context of the error. In the Graham-Rhodes method the error recovery takes place in two phases. The first phase, ca)led the condensation phase, attempts to parse the sentential form on the stack and in the unreal portion of the input string, so that as much of the context of the error as possible may be used in the second phase of the recovery. First, a backward move is attempted. This consists of an attempt to perform reductions on the sentential form on the parser’s stack. A reduction consists of the replacement of the right-hand side of some production of the grammar by the left-hand side of that production. When all possible reductions on the top of the stack have been performed, the recovery mechanism attempts a forward move, which consists of parsing the portion of the input string immediately following the error. The forward move stops when it encounters a second error in the input string or when it cannot make any more parsing actions without using the context preceding the point of the error. An arbitrary amount of the unread input string may be consumed by the forward move, but since all of this input is parsed and checked for syntactic errors, none of the source text is actually ignored.

To illustrate the backward and forward moves, consider the following productions from an example grammar:

and the erroneous ALGOL program segment:

In this example, there is a missing semicolon after the statement "A:= B — 2." Let us assume a bottom-up parser without worrying about the details of the parsing strategy. When the missing semicolon is detected, the stack could be

with B as the next input token. A backward move would reduce the stack to

The error point represented by ? is shifted onto the stack, and parsing continues with the forward move until the point

is reached. At this point, a reduction over the error point is required and the stack contents are passed to the second phase. In summary, the backward and forward moves attempt to capture the context surrounding where the error is detected.

The second phase of the error recovery performs a pattern-matching operation on the top elements of the parsing stack in an attempt to find a production in the grammar whose right-hand side is similar to the symbols on the stack. The top part of the stack is then adjusted to match the right-hand side of the production or the beginning of the production. The next input symbol is also considered in the choice of the change made to the stack.

In the previous example, the insertion of a semicolon at the error point permits the reduction of

The correction, however, is not necessarily so simple. Consider the incorrect program segment

In this program the token "IF" is missing. Assuming that the error is detected when "= " is read, the stack at this point would contain

In this case no backward move is possible. A forward move results in the following stack contents:

with the next input token of THEN. The forward move can no longer continue, and a reduction over the error point may be necessary. One possibility is to replace "=" by ":="; however, the input token "THEN," in this case, does not follow. Another possibility is to insert the token "IF" before the symbol (id), permitting a reduction to

This example illustrates that there may be more than one change which seems to permit a recovery from an error. In such instances, a choice can be made based upon the likelihood of which tokens are involved in syntax errors.

The major problem with syntax-directed recovery mechanisms is that they do not reflect the compiler writer’s knowledge of possible and plausible errors and their causes. The Graham-Rhodes method mitigates this somewhat by using a weighted minimum-cost criterion in choosing among the several stack changes which might be possible in some recovery situation. Compiler writers use their knowledge of the programming language to set up two vectors, containing the cost of deleting or inserting each symbol of the grammar. The cost of making a change represents the undesirability of making the change. The recovery routine attempts to minimize the total cost of all changes it makes to the stack. Thus the compiler writer can tune the recovery system to make it recover better in the most common error situations.