Bottom-Up Parsing (Compiler Writing) Part 13

Error Detection and Recovery in LR Parsers

Acceptable error recovery, which often involves some error repair, has been difficult to incorporate within LR, SLR, and LALR parsers. This section outlines some of the approaches to error detection and recovery, concentrating on recent efforts in implementing error-repair strategies based upon the context surrounding the error point in a source program.

We first describe some of the earlier methods of error recovery which have been used. The discussion mentions some advantages and disadvantages of these methods, providing the motivation for other error-recovery implementations.

We next identify some of the difficulties in applying the Graham-Rhodes method of error recovery to LR parsers. We then present several approaches to error recovery that have been based on the Graham-Rhodes approach.

Early methods of error recovery. To be usable, a parser requires some method which it can use to recover from errors. This section examines three methods of error recovery which are applicable to LR parsers. While some of these methods have been used in parsers, they are not considered to provide satisfactory error recovery today. Recall from Chap. 5 that error recovery refers to an attempt to manipulate the stack configuration of the parser and the input stream so that parsing may continue. Error repair, a more stringent form of error recovery, involves changing the erroneous input sequence to one that is syntactically correct so that parsing may continue. Error-repair procedures tend to provide better diagnostics, giving the programmer hints as to how the parser recovered from the error and, hopefully, giving some hints on how to fix the error for a later run.


Error-recovery and -repair methods can be considered unsatisfactory for a number of reasons. Some methods react badly to unanticipated situations; some require a large programming effort and careful anticipation of possible syntactic errors. Because of increasing software-development costs, it has become advantageous to use parser generators to generate a parser. A few recovery methods cannot be included within a parser generator, requiring the parser writer to provide a significant portion of the coding effort. Finally some error-recovery methods are overly simplistic, throwing away stack and input contents until it appears that parsing may continue. They do not take advantage of available information which would lead to better error diagnostics and recovery.

One of the earliest methods of error recovery is referred to as panic mode. Panic mode has been popular because it is very simple to implement. Furthermore, the method is compatible with automatic parser generation and can be easily included in parsers generated by parser generators.

When an error is detected by a parser that uses panic mode for error recovery, the parser throws away input symbols and stack symbols until an input symbol and state on top of the stack permit parsing to continue. The usual approach to implementing panic mode involves defining a set of synchronizing symbols. These symbols are terminal symbols of the language that constitute firm delimiters. Usually, firm delimiters end statements or structured constructs and only minimally constrain the strings which may follow. PL/I. examples of synchronizing symbols are the semicolon and the keyword end. For most languages, the set of synchronizing symbols can be determined easily (Graham, 1975).

Panic mode proceeds as follows: Input tokens are thrown away until a synchronizing symbol is read. Then elements are popped off the parsing stack until a state that permits a move based on the synchronizing symbol ends up on top of the stack. Control is passed back to the parser once a parsing move can be performed. Some improvements to this approach make use of predictions as to what symbols may follow the one on the stack and throw away input symbols until one is found or insert symbols into the input stream until parsing can continue.

As an example of panic-mode error recovery, consider the following program statement:

tmp1A-29_thumb_thumb

There is a missing operator between the variables "c" and "</". After the terminal symbol for the variable "c" has been pushed onto the stack, an error is detected when the variable name "d" is read. This error-recovery method throws away source tokens until a token which would permit a move is read. Hence the tokens for "d", "(", "6", and ")" are discarded and parsing resumes with the multiplication operator that follows the character ")".

Table 7-45 Action table augmented with error routines

State

Action table (F)

tmp1A-30 tmp1A-31 tmp1A-32 tmp1A-33 tmp1A-34 tmp1A-35

0

tmp1A-36 tmp1A-37 tmp1A-38 tmp1A-39 tmp1A-40 tmp1A-41

1

tmp1A-42 tmp1A-43 tmp1A-44 tmp1A-45 tmp1A-46 tmp1A-47

2

tmp1A-48 tmp1A-49 tmp1A-50 tmp1A-51 tmp1A-52 tmp1A-53

3

tmp1A-54 tmp1A-55 tmp1A-56 tmp1A-57 tmp1A-58 tmp1A-59

4

tmp1A-60 tmp1A-61 tmp1A-62 tmp1A-63 tmp1A-64 tmp1A-65

5

tmp1A-66 tmp1A-67 tmp1A-68 tmp1A-69 tmp1A-70 tmp1A-71

6

tmp1A-72 tmp1A-73 tmp1A-74 tmp1A-75 tmp1A-76 tmp1A-77

7

tmp1A-78 tmp1A-79 tmp1A-80 tmp1A-81 tmp1A-82 tmp1A-83

8

tmp1A-84 tmp1A-85 tmp1A-86 tmp1A-87 tmp1A-88 tmp1A-89

9

tmp1A-90 tmp1A-91 tmp1A-92 tmp1A-93 tmp1A-94 tmp1A-95

10

tmp1A-96 tmp1A-97 tmp1A-98 tmp1A-99 tmp1A-100 tmp1A-101

11

tmp1A-102 tmp1A-103 tmp1A-104 tmp1A-105 tmp1A-106 tmp1A-107

The advantage of using panic mode is that this form of error recovery is fast and requires little code. However, because input symbols are thrown away, not all errors in a program may be detected, possibly necessitating several runs to detect all the errors. Finally, little information is provided to the user concerning the nature of the error.

A more ad hoc method of error recovery has been used in table-driven parsers, including LR parsers. Whenever an error entry in the action table would normally occur, the name of an error-recovery routine is designated. The error routines, which must be hand-coded, manipulate the input stream and the stack based upon the nature of the error (as indicated by which location is being accessed in the parse table).

Table 7-45 contains an example of an action table which is augmented with the names of error routines. These names begin with e and end with a digit. The error routine whose name is the one returned by the action table is called when an error occurs. For example, the error routine el handles the case of a missing operand, generating a suitable error message and inserting an operand into the input stream. Likewise, e2 might generate a message stating that the parentheses are unbalanced and would delete the extra right parenthesis.

The advantage of using error routines is that compiler writers are allowed to use their intuitions about the likely causes of syntax errors, and error recovery and meaningful diagnostic error messages can be provided. Because the size of LR parsing tables tends to be large for most programming languages, it is not feasible for these error routines to be implemented by hand. As this approach requires human insight to indicate the likely cause of an error, it is impossible to generate the error routines automatically. Also, unanticipated errors can cause the error-recovery process to collapse and cause the parser to become unstable.

Another approach which does provide automatic error recovery is used in YACC (yet another compiler-compiler), an LALR parser generator (Johnson, 1975). The user identifies "major" nonterminals that the parser uses as a basis for error recovery. The idea is similar to that of panic mode, where synchronizing symbols also provide a basis for recovering from erroneous input. Examples of major nonterminals include (program), (block), and (statement). The user adds to the grammar an error production of the form

tmp1A-108_thumb[2]

where A is a major nonterminal and a is an often empty string of terminal and nonterminal symbols.

When an error is detected by the parser, the error routine pops the stack until it finds an element whose state is associated with the production A —> error a and then shifts the special token error onto the stack as though error had been read as input. Then, the parser attempts to resume parsing by discarding input symbols until a valid input symbol is read. For example, a grammar might have the productions

tmp1A-109_thumb[2]

where (stmt) represents assignment statements, flow of control statements, and so forth. If an error is detected while a string of tokens which would reduce to (stmt) is being parsed, the parser first pops elements off the stack until a state which would permit the error token associated with the production

tmp1A-110_thumb[2]

to be shifted onto the stack is found. As error reduces to (stmt), input tokens are then discarded until a semicolon appears in the input text. At this point, parsing resumes.

The programmer can take advantage of intuitions concerning the likely causes of errors when including error productions in a grammar; if these error productions are improperly specified, the worst case for error recovery behaves like panic mode. Furthermore, YACC has demonstrated that this method can be implemented in parser generators. While this method permits automatic error recovery, its failings include the poor handling of unanticipated errors. Also, if no state on the stack is associated with an error production, the parser dies after popping the contents of the entire stack.

The error-recovery methods discussed here are not considered totally satisfactory. We next describe a better method of error recovery, which is based on the Graham and Rhodes method introduced earlier (see Sec. 7-3.5) for simple precedence parsers.

Application of the Graham-Rhodes method to LR parsers. Recall that the Graham-Rhodes method uses a two-phase approach. The first phase, or condensation phase, condenses the surrounding context by first trying to continue reducing the sentential form on the stack and then trying to continue to parse the unread portion of the input string without referring to what has been previously parsed before the error point. The second phase is the correction phase, which analyzes the context in which the error occurs and provides diagnostic information and a repair.

The condensation phase can further be subdivided into two parts. The first part tries to make further reductions on the stack preceding the point at which the error was detected. This is referred to as a backward move. The backward move endeavors to replace right-hand sides of some productions by their left-hand sides. Reductions continue until no more can be performed.

A forward move is then tried. In this case, parsing beyond the point of error detection is performed. The parser must be started up without any left context; in some cases, this is relatively simple (such as in simple precedence parsing). Parsing continues until one of two possibilities arise. In one possibility a second error occurs. One solution to resolving the second error is to recursively invoke the error-recovery scheme on the error in order to correct it; another solution tries to redo the parse using a different parsing sequence. The second possibility, which is the most likely according to Graham and Rhodes, occurs when a reduction that includes elements on the parsing stack that are below the point where the error was detected is required. The stack configuration is then passed to the correction phase.

The backward and forward moves attempt to summarize the surrounding context where the error is detected. The forward move provides an unbounded form of lookahead; while it consumes input, none of the input is ignored.

The correction phase attempts to exploit the information provided by the condensation phase in order to correct the error by considering changes to sequences of symbols rather than isolated changes to single symbols. The correction phase attempts to find a change to the stack so that the stack matches the right-hand side of some production in the grammar. Then a reduction over the error point is performed and the correction phase returns a valid correct stack configuration to the parser.

To determine which correction should be chosen, deletion and insertion costs can be associated with the terminal and nonterminal symbols of the language based on the likelihood that such a symbol might have been inserted or omitted from the language. Also, a substitution cost can be used which is based on the possibility that one symbol was accidentally used in place of another one. This cost is typically less than the sum of the deletion and insertion costs. An example where this is useful is by associating a low replacement cost of "= V for ":= " as the equality relation might often be used in place of the assignment operator.

The correction phase attempts to find all possible repairs and returns the one which is associated with the minimum cost.

The costs for the symbols can be based on certain heuristics. For example, brackets (i.e., do, end, ‘(‘, and ‘)’) and the nonterminals generating them (i.e., (blockbody)) should have relatively high deletion costs. Graham and Rhodes hold that the cost functions can be determined mechanically, but they do not give details as to how this is done (Graham and Rhodes, 1975).

If no correction can be found or it exceeds a maximum threshold value, the correction phase returns a failure. At this point, more global-context error recovery may be needed, but work on this is only in the initial stages (Pai and Keiburtz, 1980). One of the error-recovery methods discussed previously might be invoked only as a last resort.

The Graham-Rhodes method has provided the basis upon which some LR error-repair methods have been implemented. However, it is difficult to directly apply the method to LR parsers for a number of reasons which we discuss next. The remainder of the discussion presents error-recovery methods which try to circumvent these difficulties. In particular, we describe the methods by Mickanus and Modry and Pennello and DeRemer, which are applicable to LR, SLR, and LALR parsers.

Problems of implementing the Graham-Rhodes method Several authors have reported difficulties with the direct application of the Graham-Rhodes method to LR, SLR, and LALR parsers. This presentation tries to outline the problems which have been encountered and why these problems occur.

The total amount of context information concerning the contents of the stack which is relevant to continue parsing is given by the state which the parser is in, that is, the topmost state of the stack. By referencing this state and the current input symbol (assuming a single lookahead symbol parser), it is a simple matter to determine what the next move is. When an error occurs, it is difficult to perform reductions on the stack for the backward move because the parser can no longer determine which reductions are permissible. This is because lookahead LR parsers require the state on top of the stack and the current input token to determine the next move. As the input token cannot follow in the source string given the state on top of the stack, the action table indicates an error rather than indicating what reductions can be performed on the stack.

Recall that only LR and LL parsers always detect an error at the earliest opportunity. While SLR and LALR parsers never shift an erroneous symbol onto the stack, they may perform some reductions before detecting the error. Because the symbol is not a valid one given the preceding source string, some undesirable reductions may have been performed. The error-recovery strategy must take this into account. Handling both the backward move and the possibility of incorrect reductions is usually performed in the correction phase because they are more easily dealt with when repairs to a source program are being considered. The forward move is usually still done in the condensation phase. Incorrect reductions in the backward move are best considered when forward context is available and possible repairs are being weighed.

Another problem with applying the Graham-Rhodes method to LR parsers involves the forward move. In order to determine what the next state to be pushed onto the stack is, reference is made to the left context, using the current state on the top of the stack. With an error, no such reference can be made. To start the forward move, the next input symbol is shifted onto the stack, but the state to be associated with this symbol is difficult to determine. It cannot be the starting state, as this restricts the next valid input symbol to be only one of those which may begin the language. The usual approach is to push all the states which may be associated with this token onto the stack and then begin the forward move. Two different approaches to performing this type of forward move are discussed in the next two methods.

The error-recovery method of Mickanus and Modry Mickanus and Modry describe an error-recovery method (Modry, 1976; Mickanus and Modry, 1978) which is based quite closely on the Graham-Rhodes method and consists of a condensation phase and a correction phase. Because of the problems with implementing the backward move in LR parsers, it has not been included within the condensation phase.

We first present an overview of the error-recovery method. A somewhat detailed algorithm follows this overview.

In the following discussion the parsing stack is assumed to contain both state symbols and vocabulary symbols. Recall that the stack need not contain vocabulary symbols. Their inclusion, however, facilitates the explanations to follow.

Recall that an LR parser proceeds from one configuration to another by performing shift and reduce operations. Each of the operations alters the state on the parse stack. A configuration is composed of the parse stack concatenated with the unexpended input string. In the following discussion the configurations given do not usually include all the state symbols. While these state symbols would normally be stored in the stack, they are not included in order to improve the readability of the presentation and because the names of the states are frequently not important in the discussion.

An error is detected in an LR parser when the top state in the parse stack and the current input symbol indicates an error entry in the action table. At this point in the parsing process, no shift or reduce operation applies. An error configuration which consists of the current stack contents and the unexpended input string is passed to the error-recovery method. This method will either fail or produce a repair configuration that will permit the resumption of normal parsing.

In the forward move which constitutes the condensation phase, parsing from the error point continues until one of the following situations arises:

1. An attempt is made to perform a reduction over the error point.

2. A second syntax error occurs.

In general, there are many ways of condensing an error configuration. We must consider each possibility. An attempt to perform a reduction over the error point indicates that the configuration at that point is a correction candidate. This correction candidate can now be passed to the correction phase of the error-recovery method. If we encounter a second syntax error, however, this indicates that either there are actually two errors in the vicinity or the present condensation attempt fails. If all condensation attempts fail, however, the error-recovery method is invoked recursively to try to repair the second error. A holding candidate is used as input in this case.

The correction phase attempts to produce a repair configuration based on the application of the following operations:

1. Inserting a terminal symbol.

2. Reversing the parse, that is, changing the current configuration to a former configuration.

3. Deleting a terminal symbol (and its associated state) from the top of the stack. If the state on top of the stack is associated with a nonterminal symbol, the parse must be reversed until a terminal symbol (and its associated state) appears at the top of the stack.

4. Assigning a cost of each insertion and stack deletion. The total cost of the current attempted repair is also accumulated.

5. Abandoning the current attempted repair when its cost exceeds a preset threshold value and choosing some other alternative.

The correction candidate that is passed to the correction phase consists of a sequence of states which begins with the start state and leads up to the error point. A second sequence of states thaf is produced by the condensation phase starts from the other side of the error point.The simplest case is to connect these two sequences of states by inserting a terminal symbol, that is, performing a transition from state q to state p on a terminal symbol, say, x. If the gap between the two states cannot be bridged by a terminal insertion, an attempt is made to back up the error point. Such a move means that some parser actions may have been incorrect prior to the error point. Performing successive backups of the parse stack frees previously processed input symbols that can be used to extend the condensation-phase sequence. After each backup step, an attempt is made to connect the two sequences by inserting a terminal symbol. If at any time the sequence of condensation-phase states cannot be extended backward to incorporate a freed symbol, then that symbol is deleted. The backup prpcess fails if the parse retreats to the start state or too many symbols are deleted.

The details of the condensation phase just described are presented in the following subalgorithm.

Function CONDENSE (CONFIG). Given a set of parser configurations, CONFIG, this function performs a forward move on each configuration. CONDENSE returns to the calling algorithm the repair with the least cost or a null configuration (i.e., e) if no correction can be found. It tries all possible moves as described in the text, calling the function REPAIR on each forward move. REPAIR returns the cost of the repair it finds and, through its argument, returns the repaired stack configuration. The function LFLPARSE parses the configuration, performing the forward move, and returning ‘error’ if another error is detected or ‘reduce’ if a reduction over the error point is required. The function FIRST returns the first symbol from the string passed to it. SAVE_STACK and SAVE JNPUT store the state and remaining unexpended input for the minimum-cost repair found, respectively. FOUND_REPAIR is a logical variable which indicates whether a repair has been found. MIN contains the cost of the repair having the least cost while MAX gives the maximum permissible repair cost. The variable HOLD_CONFIG contains the set of holding candidates while the variable HOLDJNPUT saves the original unexpended input string. NEW_S contains the stack configuration upon which a repair is attempted. ‘?’ is a special token stored on the stack which indicates the point at which the error was detected. P is the set of states which may shift on the first element of the input string after the error point. Q is the set of parser states and the variable q stores parser states. The function SHIFT returns the set of terminal symbols which permit shift moves for the state passed to it. C is a parser configuration from the set CONFIG. The algorithm assumes that the configuration CONFIG is of the form (S, I), where S is the stack and I is the unexpended input string. The staek is treated as a string for notational convenience.

tmp1A-111_thumb[2]

 

 

tmp1A-112_thumb[2]

When an error is detected by the parser, it calls the function CONDENSE to perform the forward move. This function receives (via the parameter CONFIG) a parser configuration set. Initially, only one configuration is received from the first (i.e., main) invocation. After a forward move, CONDENSE calls another function, REPAIR, to find a repair for the error. Because several different repairs might be possible, CONDENSE returns the configuration with the minimum repair cost based on the costs of deleting and inserting symbols.

As it is difficult to determine what state the forward move should begin in, the algorithm computes every state which will shift the first character of the unexpended portion of the input onto the stack in step 3. For notational convenience the set SHIFT(q) is defined to be the set of symbols which permit shift moves where qeQ. These sets can be determined directly from the action table. For each of these states, step 4 calls a function LR_PARSE to parse the resulting stack configuration. In this step, the stack is initialized to permit a forward move. Note that the stack is treated as a string for notational convenience. The character "?" denotes the point in the stack where the error was detected. The function LR_PARSE returns to the calling function CONDENSE whenever another error occurs or a reduction over the error point is attempted. In the second case, the function REPAIR is called to find a repair and the repair cost. If this cost is the minimum one found so far, the configuration is saved.

In the case that LR_PARSE returns because another parsing error occurs, the stack configuration is saved as a holding candidate. If no repair can be found for those forward moves which lead to a reduction over the error point, CONDENSE is recursively called to try to correct the second error using the holding candidates. The loop at step 2 handles each holding candidate which was stored in the set of holding candidates. When CONDENSE is initially called by the parser, it is passed only one configuration; recursive invocations may pass several holding-candidate configurations.

Recall that m the correction phase, the stack consists of a sequence of states and symbols up to the error point and another sequence afterward. The correction phase tries to link the two together, first by inserting a terminal symbol. If that fails, the error point is backed up as it is assumed that the error might have occurred earlier in the stack. For example, consider the ewoneous program segment

tmp1A-113_thumb[2]

In this program the token" if’ is missing. An LR parser will not detect the error until the " = " is read. At this point the stack would be

tmp1A-114_thumb[2]

A forward move results in the configuration

tmp1A-115_thumb[2]

with the "then" as the next input symbol. The forward move can no longer continue, and a reduction over the error point may be necessary. One possibility is to insert ":"; however, the "then" seen in the input stream does not follow. Another possibility is to back up the parser. After the backup the insertion is tried again. However, if no backup can be made, the topmost symbol on the stack is deleted and if it is a nonterminal symbol, this is done by backing up the parse to the point just before the reduction to that symbol is made. In this case, the algorithm assumes that the error might have occurred in some part of the source program which had been reduced to that nonterminal symbol. When a backup of the error point is performed, it is impossible to tell from the stack what the previous state and input symbol are. Hence there are several possible states that are called rightstates.

The following subalgorithm performs the correction phase:

Function REPAIR(S). Given a stack configuration, S, of the formtmp1A-116_thumb[2] wheretmp1A-117_thumb[2] and Q is the set of parser states, this algorithm tries to repair the stack configuration. For clarity, it is assumed that the stack is in the above form whenever the loop of statement 2 is reexecuted. P and P’ are sets of rightstates. The variablestmp1A-118_thumb[2]are states on the stack S while 8 and

8′ are terminal symbols on the stack, y is a terminal symbol. The variable COST gives the cost of the repair; MAX contains the maximum permissible repair cost. The vectors INSERT and DELETE give the insertion and deletion costs for inserting and deleting terminal symbols. A(b) and D(b) are the sets of ancestors and descendents oftmp1A-119_thumb[2], respectively. The table G represents the goto table for the parser. The statement Recheck is used in the algorithm to cause a jump to the beginning of the innermost loop in which it is nested. If the condition of the loop evaluates to true, the loop is executed again. Note that P contains initially only one state, but as backup occurs, P is expanded to contain several states.

tmp1A-124_thumb[2]

 

 

 

tmp1A-125_thumb[2]

In step 3 of the algorithm the problem of inserting a symbol y involves more than just finding atmp1A-126_thumb[2]such that G(b, y) = a (assuming that the stack is of the form a’b'Xb?a/3 as described in the function’s preamble). Some reductions might be necessary on a’b'Xb before y can be shifted onto the stack. This problem can be resolved by examining not only the state b but also the states which are reachable from state b after a reduction. Thus the descendent states of the state must be considered. D(b) is defined to be those states r, such that if the parser is in state b, some reduction permits the parser to change to state r. By definition, tmp1A-127_thumb[2]. Similarly, when y is pushed onto the stack, some reductions involving y might be needed on the stack so that the parser ends up in state a. A(a) is defined to be the set of ancestor states r such that if the parser is in state r, some reduction permits the parser to change to state a. By definition,tmp1A-128_thumb[2]This means that we are looking for atmp1A-129_thumb[2]such that for sometmp1A-130_thumb[2]

If no repair can be found, it is possible that the error might have occurred earlier in the source program. Hence a backward move on the stack is needed which undoes the most previous shift or reduction on the stack. Step 4 computes the set of rightstates which are the possible states that might have been on the stack. Initially, a is the only member of the set.

If a backup cannot be performed, the symbol on top of the stack is deleted in step 5. If this symbol is a terminal, deletion is simple. Furthermore, if a descendent state of the state on top of the stack after it is popped is also in the set of rightstates, a repair has been found. If not, execution returns to step 3 to see if a replacement for the deleted symbol can be found. On the other hand, if the symbol to be deleted is a nonterminal, the stack is reparsed from a previous configuration until the point just before the reduction to be made is reached. This allows the error-recovery method to determine if the error occurred at a point which had previously been reduced. To perform this parse, it is necessary for the parser to save the input program and previous stack configurations. Mickanus and Modry do not report on how much input is to be saved or what parser configurations are to be saved.

If the correction phase fails, Mickanus and Modry assume that it was spurious and have their parser delete the input token at the point the error was detected. It is possible that some other error-recovery method, such as panic mode, could be used at this point.

In order to illustrate this error-recovery method, consider the program segment

tmp1A-136_thumb[2]

There is a missing semicolon after the second statement. The stack configuration at the point where the error is detected is

tmp1A-137_thumb[2]

For simplicity, assume that there is only one initial state which can be used to start up the forward move that does not end because of a second error. Hence there is only one forward move to be considered initially.

After the forward move, the function CONDENSE would produce the stack configuration

tmp1A-138_thumb[2]

where the third and fourth statements of the program segment are reduced to (stmts). The token end is the current input token. The forward move terminates at this point because of a reduction over the error point due to the production

tmp1A-139_thumb[2]

The function REPAIR is then called to find a repair to the error. First, an insertion is attempted. An insertion of a semicolon succeeds because a descendent of the top state on the stack before the error point is one which would become on top of the stack if "var:= (integer) ;" is reduced, and this descendent state is an ancestor of the state associate^ with (stmts).

If no insertion of a terminal symbol is possible, a backup of the error point on the stack is tried. The resulting stack configuration becomes

tmp1A-140_thumb[2]

The set of right states associated with (integer) is also stored on the stack; these states are all the possible ones that can be associated with (integer) given the set of descendent states for (stmts). Then a repair is tried again. If no backup was possible, step 5 of the function would have deleted (integer).

Once all the possible repair candidates have been determined and no repair cost is below the maximum permissible value, repairs on the holding candidates are tried. These candidates would be passed to a recursive invocation of CONDENSE in step 6. If a repair is found, an attempt is made on the repaired forward move to correct the first error.

Mickanus and Modry report that finding a symbol for insertion can be simplified by adding states to the LR parser. Their method can easily be included within a parser generator. They do indicate, however, that the resulting table can be significantly larger.

This error-recovery method appears to be quite effective. The method requires that, in general, the parse be backed up. This can cause some semantic problems. It may be too difficult and time-consuming to alter the semantic actions on performing such a backup operation. A second problem with the method is that a repaired configuration may not be a rightmost cannonical sentential form (see Mickanus and Modry, 1978).

An error-recovery method by Pennetto and DeRemer Another error-recovery method which is based upon the Graham-Rhodes method is put forward by Pennello and DeRemer (1978). Their method attempts to take advantage of a property of LR grammars in order to reduce the number of reparses. Like the method of the previous subsection, their method also avoids the backward move in the condensation phase.

The forward-move algorithm is based upon the observation that many symbols which occur in a language are followed by the same set of strings so that the parser can continue parsing in the same fashion. For example, in many languages a semicolon is used as statement terminator and a statement usually follows a semicolon. As another example, consider the following productions from the PASCAL grammar (Wirth and Jensen, 1974):

tmp1A-141_thumb[2]

In each case, the symbol do is followed by (stmt). Hence, no matter in which of these statement types an error occurs just before the do, a statement is expected to follow the do.

The forward-move algorithm, which is to be given shortly, operates in a manner similar to that of Mickanus and Modry except that it stacks all the initial states as a set and continues parsing using all these states at the same time. Parsing continues until another error occurs, a reduction requiring left context is needed (i.e., over the error point), or more than one move is indicated. The important idea behind this approach is that once such a parse is done, the input which has been parsed does not need to be parsed again because the same sets of moves would be made in all cases. Consider the statements

tmp1A-142_thumb[2]

from the PASCAL grammar example given earlier. In either case if an error occurs before the token do, the statement "/’:= i + 1" will reduce to (stmt).

The main idea behind the forward-move algorithm is to continue parsing the input string just past the error point using a particular kind of parser called an error parser. The approach is to obtain all possible parses of the remainder of the input past the error point. These parses, however, must manipulate the stack in the same manner at each parse step. The forward-move algorithm behaves like an LR parser. The error parser pushes sets of states rather than simple states (as is the case in an ordinary LR parser) onto the parse stack. These sets of states enable us to keep track of parallel parses. At each step in the error parser, each state in the top state set of the stack is examined along with the next input symbol. If each state in the top state set indicates a shift (or an error), then the stack is pushed. In this case the set of all states obtained from making a transition from each state in the top state set on the next input symbol is pushed on the stack. If, however, the indicated move for each state in the top state set is a reduce (or an error) action and context to the left of the error point is not required, then a reduce operation is performed. In such a case state sets are popped from the stack and a new state set (depending on the reduction) is pushed on the stack. The process stops because of either another error, the lack of input tokens, more than one move being permitted, or more context to the left of the error point being required.

The following subalgorithm implements the forward-move strategy just described.

Function FORWARD_MOVE(l). This function performs the forward move to generate the configuration C as described earlier using the unread input string I. The input is assumed to be delimited by the end-of-file marker ‘#’. The variables r, t, and t’ are states of Q, and P is a set of states from the set of parser states Q. S is the stack of the parser. K is the set of states on top of a stack configuration Ur MOVES gives a set of parser moves given by the action table for some input symbol and set of states. The variable i is an index variable. The function REST deletes the first symbol from the string passed to it and returns the result. The function TOPV returns the topmost set of states on the top of the stack. The functions FIRST and SHIFT are the same as in previous algorithms. PUSH and POP are procedures for pushing and popping the stack, respectively.

tmp1A-143_thumb[2]

The first step of the algorithm initializes the stack to the set of parser states Q (denoted by ? in a trace which follows shortly). Step 2 next computes the state set, which may shift the first symbol in the remaining input. The third step contains a loop to process, potentially, all of the remaining input tokens. Step 4 obtains the next input token and the top state set on the stack. Step 5 determines the set of possible moves that the parser can make at this point. The next step exits from the loop if more than one move is possible. Step 7 performs, if possible, the indicated action. If the move is a shift operation, the set of destination states is pushed on the stack. On the other hand, a reduce move involves popping the appropriate number of state sets from the stack, provided the reduction does not involve the error point (?). The algorithm exits the loop if there are not enough state sets on the stack or if the indicated move is either an error or accept move. The last step of the algorithm returns the stack configuration.

As an example, consider the application of the previous algorithm to the grammar whose rules are the following:

tmp1A-144_thumb[2]

The parsing tables for this grammar are given in Table 7-46. Consider the erroneous input string i(i — /’)#. On detecting a missing operator, the parser halts with the stack containing state 0 at its bottom and state 2 on its top. Table 7-47 contains a trace of the forward-move algorithm on the remaining string (/ – /’)#. Observe that the parser halts with a top state set of {4,8,11}. In applying the last reduction in the table (i.e., T —* F), we pop the state set {3} from the stack. This causes the state set ? to become the new top state set. Based on the state set ? and the nonterminal T, there are four transitions possible:

tmp1A-145_thumb[2]

Consequently, we stack the state set {4,8,11}. Step 5 of the algorithm then computes the set MOVES as follows:

tmp1A-146_thumb[2]

Table 7-46 Parsing tables for sample grammar

State

Action function, F

Next state or goto function, G

i

-

tmp1A-147

(

)

#

S E

T

F

-

tmp1A-148

( )

0

S

S

1

4

3

2

5

1

tmp1A-149 tmp1A-150 tmp1A-151 tmp1A-152 tmp1A-153

6

2

tmp1A-154 tmp1A-155 tmp1A-156 tmp1A-157 tmp1A-158

3

tmp1A-159 tmp1A-160 tmp1A-161 tmp1A-162 tmp1A-163

10

4

tmp1A-164 tmp1A-165 tmp1A-166 tmp1A-167 tmp1A-168

5

S

tmp1A-169 tmp1A-170 tmp1A-171 tmp1A-172 tmp1A-173

7

4

3

2

5

6

s

tmp1A-174 tmp1A-175 tmp1A-176 tmp1A-177 tmp1A-178

8

3

2

5

7

tmp1A-179 tmp1A-180 tmp1A-181 tmp1A-182 tmp1A-183

6

9

8

tmp1A-184 tmp1A-185 tmp1A-186 tmp1A-187 tmp1A-188

9

tmp1A-189 tmp1A-190 tmp1A-191 tmp1A-192 tmp1A-193

10

s

tmp1A-194 tmp1A-195 tmp1A-196 tmp1A-197 tmp1A-198

11

3

2

5

11

tmp1A-199 tmp1A-200 tmp1A-201 tmp1A-202 tmp1A-203

Table 7-47 Trace of an error parser for the. input stringtmp1A-204

Step

Stack contents

Rest of input

Action taken

1

?

tmp1A-205 tmp1A-206

2

?{5}

tmp1A-207 tmp1A-208

3

?{5} {2}

tmp1A-209 tmp1A-210

4

?{5} {3}

tmp1A-211 tmp1A-212

5

?{5} {4}

tmp1A-213 tmp1A-214

6

?{5} {7}

tmp1A-215 tmp1A-216

7

?{5} {7} {6}

tmp1A-217 tmp1A-218

8

9 {5} {7} {6} {2}

tmp1A-219 tmp1A-220

9

9 {5} {?} {6} {3}

tmp1A-221 tmp1A-222

10

?{5} {7} {6} {8}

tmp1A-223 tmp1A-224

11

9 {5} {7}

tmp1A-225 tmp1A-226

12

?{5} {7} {9}

tmp1A-227 tmp1A-228

13

?{3}

tmp1A-229 tmp1A-230

14

? {4,8,11}

tmp1A-231 tmp1A-232

Since there are three possible moves (i.e., reductions), the algorithm stops. The reductiontmp1A-233_thumb[2]holds only iftmp1A-234_thumb[2]immediately precedes T on the stack.

tmp1A-235_thumb[2]is the proper reduction only if E — immediately precedes T. Similarly,tmp1A-236_thumb[2]applies only if nothing or a left parenthesis immediately precedes T. Since there is no left context on the stack, parsing cannot continue in a deterministic manner. This example, however, indicates three possible guesses at the reduction.

Recall that the parse stack in an LR parser represents, at any given point in the parsing process, a viable prefix of a sentential form. A string is a viable fragment of the remainder of an input sentencetmp1A-237_thumb[2]wheretmp1A-238_thumb[2]if and only if

tmp1A-245_thumb[2]

2. If a is a viable prefix of a sentential form

tmp1A-246_thumb[2]

In other words, during the parsing of any sentence whose suffix is xy, x must reduce to the valid fragment U. The forward-move algorithm produces a viable fragment. In the previous example (see Table 7-47)tmp1A-247_thumb[2] tmp1A-248_thumb[2]

The significance of a derived viable fragment to an error-recovery method is the following: Assume that the parser detects an error in the given input such that a represents the parse stack and the string xy is a suffix of the given input sentence. Tf the repair algorithm generates several a’ as possible replacements for a, we could reparse x for each replacement. Because of the notion of a derived viable fragment, we need reduce x to U only once. That is, x is reduced to U no matter what lies to the left of string xy. Pennello (1977) presents a theorem and proof of this assertion.

Once an error is detected, the forward-move algorithm can be invoked to consume all of the input; once another error is found, a reduction is required, or more than one move is possible, the forward move is performed again, resulting in another stack configuration. This leads to the set of stack configurations (derived viable fragments) Ux, U2,…, Un once the entire source program has been read. The following algorithm generates the desired set of derived valid fragments. The algorithm returns the resulting set of viable fragments once all of the input has been read.

Function T_FORWARD_MOVE(l). This function generates the set of configurations U,. U2,…, Un as described earlier using the unread input string I. C denotes the collection of viable fragments.

tmp1A-251_thumb[2]

The above approach generates state sets dynamically by referring to the parser states. We can, however, precompute these state sets and their associated transitions. The approach is similar to that taken in converting a nondeterministic machine to its deterministic equivalent (see Chap. 4). The details of this approach are not pursued further.

The correction phase attempts to find insertions, deletions, or substitutions which would repair the error. The algorithm which tries to find a repair follows. To simplify the error-repair process, the forward-move algorithm is applied to the input following the input character at which the error was detected. This method handles the case that this character might be erroneous much sooner than does the method of Mickanus and Modry. The method used by Pennello and DeRemer tries a deletion of the input token before considering other possible approaches to finding a repair. This is unlike the method of Mickanus and Modry, which deletes the input token only if no repair can be found by the error-recovery procedures.

Function ERROR_RECOVERY(C). Given the erroneous parser configuration C, this function attempts to find a repair for this configuration; if none can be found, it returns the stack configuration FAIL, which indicates that no repair could be found. The parser configuration C is also represented by (S, I), where S is the stack and I is the unexpended input string. C’ (which includes the stack S’ and input string I’) is a parser configuration. The variable h stores the first token of the input string which is not used in the foward move. FMC and FMC’ are sets storing the forward-move configurations returned by the function T_FORWARD _MOVE. The function TRY(S, T, C, CONFIG) attempts to "join" the stack configuration S to the set of configurations C using one of the terminal symbols in the set T. The resulting configuration is returned in CONFIG. The function also returns a logical value which indicates whether the stack and set of configurations can be joined. The function ATTACH returns a logical value indicating whether the terminal symbol passed to it as its first argument can be attached to the set of configurations also passed. The resulting set of configurations is returned through the second argument. The function ACCESSING_SYMBOL returns the input symbol which led to the top state on the stack. The functions FIRST and REST are the same as in previous algorithms.

tmp1A-252_thumb[2]

The function ERROR_RECOVERY first attempts to delete the input character. The function TRY, which is called tries to "join" the stack S’ with the set of forward-move configurations FMC which were returned by the forward-move algorithm using one in the set of characters passed to it as its second argument. The "join" can be performed in a manner similar to the way Mickanus and Modry did it and is not described in this subsection. TRY does attempt to join as many configurations together as possible. For example, once S’ and configuration U-, are joined, reductions might be possible on the result which then permit the result to be joined with U2. This is repeated until no more joins are possible.

If a deletion fails a substitution for the character h is attempted by passing the set of terminal symbols to TRY instead of an empty string. If a repair is found, the repaired configuration is returned to the parser.

The next repair attempt, which is performed in step 6, tries to find an insertion before the character h. The function ATTACH is used to join h with U1 and, possibly, the other configurations. TRY is then called to see if a terminal symbol can be inserted before h. If the insertion attempt succeeds, the repaired configuration is returned.

If none of the repair methods succeed, the algorithm tries to back up the stack. It cannot if h could not be attached to the forward-move configurations. Step 8 reports a failure and returns in this case. Step 7 performs the backup, attempting to undo reductions on the stack which should not have been done. The current approach used by Pennello and DeRemer is to back up the stack to the point where the previous input symbol was shifted onto the stack. The function ACCESSING_SYMBOL returns this symbol and resets the stack. Pennello and DeRemer (1978) state that this approach is only in the formative stage, and they are still considering other approaches.

In the program segment a syntax error occurs in the third statement where the assignment operator has been omitted.

tmp1A-253_thumb[2]

The forward-move algorithm begins by generating stack configurations with the semicolon following the variable "next". Depending upon the grammar for the language, stack configurations would likely be formed as follows: The semicolon cannot be reduced and is in the first configuration. It is the only symbol in the first configuration, as there may be more than one possible move beginning with the variable "save_val". However, the for loop and the statement "save(ctr):= 0" which follows are reduced to "(stmt)(stmt)". The token end cannot follow the second statement without the context, which indicates that there is a preceding do. Hence the three resulting stack configurations would be ";", "(stmt)(stmt)", and end.

In order to repair the error, the error-recovery function first tries to delete the variable "next" and to join together the stack with the stack configuration Uv As this attempted repair will not work, a replacement for the terminal symbol for "next" is then tried. First of all, the terminal symbol for "next" is joined with the stack configuration LI-,, resulting in the configuration "(var) ;". Then an insertion of a terminal symbol which permits the stack and the configuration U, to be joined is tried. In this case, inserting the assignment operator succeeds. However, if no insertion is found, the stack is backed up and another try at repairing the error is done.

When an insertion was being tried in the preceding example, if the terminal symbol for "next" could not be joined with the first stack configuration, the error-recovery procedure would have failed.

Like the previous method described earlier, if the present method fails to find a repair, another method is required to repair the error. However, Pennello and DeRemer state that which approach to use is not clear and the algorithm given merely returns, reporting a failure. A method such as panic mode might then be used.

In order to implement their method, Pennello and DeRemer indicate that the parser tables can be expanded to include states which represent the sets of states used for error repair. These states can be computed mechanically and do not generate an overly large number of states. They also give further algorithms for their implementation method (Pennello and DeRemer, 1978).

Pennello and DeRemer report that their error-recovery method has about 70 percent of the error diagnoses rated as either good or excellent, 12 percent rated poor, and 18 percent unrepaired. Their measures include spurious errors which are not considered by other authors. Such errors are usually not repaired. Pennello and DeRemer report that their error-recovery method requires an increase in the number of parser states by 20 to 30 percent.

Other LR error-recovery methods. Other methods of error recovery for LR parsers have been designed. This section briefly discusses two approaches. These methods are not as rigorous as the previously described methods in attempting to repair an error.

One approach, discussed by Graham, Haley, and Joy (1979) tries to provide good error diagnostics without attempting to provide the best, if any, repair. They use a two-phase approach. The first phase tries to find a repair by inserting a terminal symbol or by deleting or substituting the input token at the point where the error occurred. Error recovery is simple and does not try to perform or undo any necessary reductions on the stack like the method of Mickanus and Modry.

If no repair can be found, error recovery switches to a second phase which operates in a manner similar.to the way error recovery in YACC is done. For this method, the grammar is augmented with error productions. The stack is popped and input is ignored in order to resume parsing.

Druseikis and Ripley (1976) describe an approach to error recovery for an SLR( 1) parser. Their method performs simple, fast recovery without any repair. A forward move is made, but there is no attempt to integrate the information provided by the forward move into the correct prefix. Pennello and DeRemer’s method described in the previous subsection is a generalization of Druseikis and Ripley’s method. The former method works for all types of LR{\) parsers.

Next post:

Previous post: