Compile-Time Error Handling Part 3

Secondary Error Recovery

Some error-recovery strategies, especially ad hoc strategies, are not always successful in localizing and recovering from an error. Such strategies may be prone to unbounded looping and may have to be limited in order to prevent them from consuming all available time or all of the source text. In such cases, it is necessary to have a secondary recovery scheme to fall back on when the usual system fails. Such a recovery scheme must be guaranteed to recover and must be certain to localize the error condition. The usual methods of performing the secondary recovery are panic mode (Graham and Rhodes, 1975) and unit deletion.

Panic mode is the process of throwing away source text until any of a number of firm delimiters is found. An example of a firm delimiter is the ‘;1 in PL/I. When such a delimiter is found, the stack is popped to a point where the left context it represents may be followed by the syntactic structure which follows the delimiter. In some simple parsers, panic mode is used as the only form of error recovery. It certainly succeeds in localizing a syntactic error, but it may throw away an arbitrary amount of source text, which is never checked for local validity. When an error is discovered in a PL/I statement, panic mode throws away source symbols until it finds a ‘;’, cleans off the stack so that the parser is ready to accept a statement, and resumes parsing after the semicolon.

Unit deletion refers to the deletion of the whole of a syntactic structure from the source text. Its effects are the same as those of panic mode, but since the whole of the syntactic unit (not just its end) is deleted, unit deletion preserves the syntactic correctness of the source program and is appropriate for error-repairing compilers. It is more difficult to achieve than is panic mode, since the translated form of the whole unit must be saved in the parser so that it may be easily deleted.

It should be noted that some of the formal methods for recovering from errors in syntax-directed parsers are able to recover from any error, and a secondary recovery is unnecessary.

Context-Sensitive Recovery

Sections 5-4.1 and 5-4.2 focused primarily on recovery from context-free errors. Context-free recovery has been more thoroughly studied than context-sensitive recovery because context-free parsing methods have been more adequately formalized. The checking of context-sensitive restrictions is usually done in an ad hoc fashion, and therefore, context-sensitive recovery is also ad hoc. Thus it is rather difficult to formulate a general theory of context-sensitive recovery.

The most efficient form of context-sensitive recovery is often to simply report the error, ignore it, and disable the code-generation portion of the compiler (since the program is no longer syntactically valid, translation may be impossible). Context-sensitive syntax is usually expressed as a set of restrictions on a larger context-free language. The parser is free to accept any member of the context-free language, giving messages and suppressing translation if the program does not fall within the smaller language defined by the context-sensitive restrictions. Context-sensitive errors usually have very local effects, and little need be done to recover from them. If the error involves undefined identifiers, it may be desirable to insert the identifiers into the symbol table, making assumptions about their types based on where they occur. This type of error recovery is discussed in more detail in Chap. 11.

Error recovery is the minimum acceptable response to syntactic errors. The most difficult problem for error recovery arises from the demands of error repair. Section 5-5 discusses error repair as an extension of error recovery, with more ambitious goals and more sophisticated techniques.

ERROR REPAIR

The advantages of syntactic error repair over error recovery have been discussed in Sec. 5-1. Error repair is a form of, error recovery which guarantees the syntactic validity of the intermediate form of the program which is passed to other parts or passes of the compiler. Thus the compiler may perform complete checking and translation of the correct portions of the original program.

Section 5-2.3 mentioned the minimum-distance repair algorithm devised by Lyon (1974) which takes considerable pains in attempting to achieve a good repair, that is, a repair which is often a correction. It must be emphasized that even this expensive algorithm cannot always correct errors. A compiler does not have enough information to determine what is correct, in terms of what the programmer intended, and no present-day compiler is sophisticated enough to make use of this information if it were available.

The general outline of error recovery in Sec. 5-4 allowed modifications to the parse stack, symbol tables, and the unread input string during the process of recovery. When repair is required, modifications to the left context become quite undesirable. The left context represents accumulated information concerning source text which has already been translated and is no longer available. If stack and table modifications are to be performed, the repair system must be able to retract much of the processing already done on the erroneous part of the program. This means added complexity and space for storing parts of the source and its translation.

Fortunately, parsing algorithms which have the valid prefix property (see Sec. 5-2.3) need not tamper with the parse stack in order to recover or repair. LR (Sec. 7-4), LL (Sec. 6-2), and most top-down, no-backup ad hoc parsers have this property and are able to recover or repair by inserting or deleting source text.

Ad Hoc Repair

All the discussion in Sec. 5.4-1 about ad hoc recovery applies to ad hoc repair as well, since error repair is a sophisticated means of error recovery. The methods available for invoking the error-handling routines are unchanged. The error-handling routines must now be coded in such a way as to produce a syntactically valid output for the rest of the compiler. The advantages and disadvantages of the ad hoc repair methods parallel the advantages and disadvantages of ad hoc error recovery.

Syntax-Directed Repair

A syntax-directed repair mechanism is one which is general to a class of languages and which can be generated by a formal algorithm. This section outlines a syntax-directed error-repair strategy for a top-down, no-backup parser (see Sec. 2-4.2). The design is adapted from Holt and Barnard (1976) and uses a cost vector to allow the compiler writer to tune the error repair (see Sec. 5-4.2).

The top-down parser attempts to build a syntax tree describing the source program. It does this by repeatedly replacing the leftmost nonterminal symbol in the tree with the right-hand side of one of its productions. An error is detected when the downward growth of the syntax tree requires a particular terminal symbol to appear in the input string, and that terminal symbol is not present. This parser has the valid prefix property, since it detects a syntactic error wheii the first erroneous symbol is encountered.

The error-repair strategy is a relatively simple one. When the syntax tree demands that a particular terminal occur in the input string, and the terminal does not occur, the repair algorithm inserts the required terminal into the source text (input string). In some cases the syntax tree may require one of a number of terminal symbols, none of which occurs in the input string. In this case, an insertion-cost vector, which specifies a cost for inserting each symbol of the grammar, is used to determine which of the required terminals should be inserted. The repair algorithm attempts to minimize the cost of the insertion. The input symbol which caused the error is discarded unless the inserted terminal symbol is a special symbol (not an identifier or constant) and the input symbol is not a special symbol.

Notice that the repair action just described need not actually remove the error. If the input symbol is retained, it is possible that it may not validly follow the inserted symbol. In such a case the repair system must perform another insertion. The repair system should check for this condition before returning to the parser in order to prevent the parser from reporting the error again. Furthermore, the repair system must contain limits which prevent it from entering an unbounded sequence of insertions which never correct the error (whether or not this is possible depends on the grammar).

Because this simple repair strategy may fail to repair the syntactic error, Holt and Barnard provide a three-level repair system. If a statement or line boundary is encountered in the syntax tree or in the input string during a repair, a line-level repair strategy (akin to the unit deletion of Sec. 5-4.3) is invoked to generate or delete a line of source text. If this strategy also fails, and an end-of-program delimiter is encountered, the highest level of repair takes over and generates the rest of the program or deletes the rest of the input string.

When the inserted terminal symbol is an identifier or a constant, there are several possible choices of identifiers or constants to insert. Based on the information in the symbol table concerning identifier scope and type, it might be possible to insert an identifier which can validly appear in this context. In some cases there may be only one such identifier. If there are several identifiers which can appear at a given point in the source text, the choice must be arbitrary. The choice of a constant to insert is nearly always arbitrary, except where the context restricts the range of values which may appear (in array subscripts, for example). Holt and Barnard (1976) always insert 1 for a numeric constant, "?" for a string constant, and $NIL for an identifier. An arbitrary choice of one of the possible identifiers may sometimes yield a correction of the error; $NIL is never a correction, only a repair. In order to prevent context-sensitive errors due to the attributes of $NIL, there should be a universal type attribute which allows an identifier like $NIL to be an operand of any operator in the language. Care must then be taken by the code generator to produce some sort of reasonable translation for this inserted identifier.

Context-Sensitive Repair

Unlike a simple recovering compiler (Sec. 5-4.3), an error-repairing compiler cannot ignore context-sensitive errors. The intermediate form of the program which is passed to other parts of the compiler must contain operands with correct attributes in every expression. When the attributes of some operand do not match those required by a particular operator, an operand of the required type must be substituted for the invalid operand. The comments of the previous paragraph apply here. A dummy identifier like $NIL with the universal type of attribute may be substituted for the invalid operand. Constants and expressions of incorrect type or value must be replaced by compiler-generated constants and expressions with the correct attributes. The most effective way to do this is usually clear from the characteristics of the error-detection routine itself. Since the context-sensitive restrictions are usually tested in an ad hoc manner, the repair strategy for each case can be tailored to and embedded in the error-detection routine.

Spelling Repair

The alphanumeric identifier plays a very large role in most programming languages. Identifiers are used to denote variables, constants, operators, labels, and data types. Many languages contain a class of special identifiers called keywords, which are used as syntactic delimiters. Typing errors and forgetfulness often lead to misspelled identifiers. Thus an identifier or incorrect keyword appears where a particular keyword should be found, or a keyword or undefined identifier appears where a particular variable identifier is intended. In such cases the syntactic error can often be repaired by means of the algorithm described in this section.

Morgan (1970) presents an algorithm which attempts to select a legal identifier which is close to the illegal identifier which occurs in the source program. If the algorithm succeeds, the repair consists of substituting the legal identifier for the illegal one.

Function SPELLING_REPAIR(SUBJECT). Given the incorrect identifier SUBJECT, this function invokes function COMPARE to determine whether two identifiers are similar enough that one could be a misspelling of the other. Local variables are i and j (strings), and I (set of strings).

Morgan indicates that "it is inefficient to attempt repair of variable names with fewer than three characters." This applies not only to variable names but to keywords and other identifiers as well.

Morgan’s algorithm uses a sub-algorithm COMPARE to test whether or not the SUBJECT is in some sense close to the identifier j. The compiler writer may choose any reasonable method for performing this comparison. The algorithm should return true if and only if SUBJECT may be obtained from j by means of some transformations which represent likely spelling mistakes. The particular COMPARE algorithm used by Morgan is based on Damerau (1964). This algorithm returns true if and only if SUBJECT may be obtained from j by one of the following transformations:

1. One symbol changed

2. One symbol deleted

3. One symbol inserted

4. Two adjacent symbols transposed

Damerau and Morgan report that single instances of one of these four transformations account for more than 80 percent of the spelling errors encountered in their computer installations. Based on a COMPARE algorithm which accepts only these transformations, step 1 of Function SPELLING_REPAIR may be modified to reject any identifier i whose length is different from the length of SUBJECT by more than one. In this section no such change has been made to Function SPELLING_REPAIR. The test for length difference is made in step 1 of Function COMPARE.

We now present Damerau’s version of the COMPARE algorithm.

Function COMPARE(S, T). This logical function has two string parameters S and T whose order does not matter. The string functions LENGTH and SUB are used to take the length of a string, and to access a substring of a string, respectively. The numeric functions ABS and MIN are used to take the absolute value of a number and to find the smaller of two numbers. Local variables are LS, LT, and j (integers).

The above function works only on single misspelled identifiers. It is possible that an identifier might be misspelled in such a way as to become two identifiers. For example, some symbol in the identifier might be changed to a blank or operator symbol, thus separating the identifier into two parts. Similarly, the first symbol might accidentally be made into a digit, changing the identifier into a number or a number followed by an identifier. Thus Function SPELLING_REPAIR does not repair all spelling errors, even within the limited set of transformations just described. However, owing to the layout of the keyboard and to the human mind, spelling errors usually involve changes and insertions of letters, not special characters.

This topic has presented an introduction to the problems and techniques of compile-time error handling. The detection and reporting of syntactic errors have been treated in depth. The introduction to error recovery and error repair in Sees. 5-4 and 5-5 provides the background needed to understand the discussions of detailed methods of error handling which appear in Chaps. 6, 7, 8, and 11.