Bottom-Up Parsing (Compiler Writing) Part 6

Error Recovery for Simple Precedence Parsers

The simple precedence parser described in the preceding sections performs a correct parse of any input string which is in the language generated by a particular simple precedence grammar. Furthermore, it refuses to accept any string which is not in the language. Given such a string, the parser stops at the first syntax error and reports that the string is invalid. As mentioned in Sec. 5-1, this behavior is unacceptable for a practical compiler. In a compiler, parsing must continue after a syntax error, in order to detect more syntax errors.

The simple precedence parsers described so far can detect errors in two ways. The choice of a parse action is based on the precedence relation between the symbol at the top of the parse stack and the input symbol. This precedence relation may be <, =, >, or ? (error). The ? relation indicates that this pair of symbols (the stack top and input symbols) cannot appear adjacent to one another. Thus, whenever the relation between the stack top and input symbols is ?, a syntax error is detected (a so-called character-pair error). Whenever the relation between the stack top and input symbols is >, a reduction action is indicated and the parser must match the right-hand side of some production in the grammar to the string of stack symbols being reduced. If no matching right-hand side can be found, then a reduction error is reported.

In this section we develop a simple precedence parser which has improved error-detection capabilities. However, because the simple precedence parser does not have the valid prefix property discussed in Sec. 5-2.3, a syntax error may sometimes go undetected until arbitrarily many symbols have been read beyond the first erroneous symbol. After developing the improved simple precedence parser, we discuss an algorithm which can recover from a syntax error and restart the parser so that the parser can detect further errors. This recovery algorithm was developed by Rhodes (1973) and Graham and Rhodes (1975) (see Sec. 5-4.2). Other simple precedence error-recovery schemes are presented by Wirth (1968) and Leinius (1970). However, the Graham-Rhodes method seems to be superior to these other simple precedence recovery methods.


The improved simple precedence parser uses the character-pair error check and the reduction error check just described. It also contains two additional means of error detection, a valid prefix check and a left stackability check. The valid prefix check (Rhodes, 1973) consists of checking that the partial handle on top of the parse stack is a prefix of the right-hand side of some production in the grammar. This check is performed whenever a symbol is put onto the stack by a reduction or a shift. The partial handle tested is the string of stack symbols extending from the < relation nearest the stack top up to the symbol on top of the stack.

The reader might assume that this additional check would take a large amount of time, since the table of production right-hand sides must be searched as each symbol is pushed onto the stack. In fact, the valid prefix check can be combined with the reduction error check and both types of error check can be performed with approximately the same amount of searching as is done by the unmodified simple precedence parser of preceding sections. To do this, the production table is sorted lexically by right-hand sides, and pointers into this table indicate which right-hand side’s have prefixes which match the symbols of the partial handle already on the stack. Before a new symbol is stacked, these right-hand sides can be checked to see whether any of them allow the new symbol following the symbols already seen. Notice that each time a < relation occurs, a new partial handle is begun on the stack. Therefore, a secondary stack must be used to save the pointers into the right-hand side table for the previous partial handle. This treatment of the valid prefix and reduction error checks has not been incorporated into Algorithm SP_PARSE (which occurs later in this section) in order to preserve clarity as much as possible. But it should be clear that the method consists of performing an incremental search for matching right-hand sides, rather than searching for a right-hand side whenever a reduction is indicated. The total amount of searching performed is only slightly more than in the unmodified parser.

The left stack ability check was originally introduced by Leinius (1970). When the handle is reduced, it is replaced on the stack by the left-hand side of a production whose right-hand side matches the handle. If there is a production U -» /? and the stack contains aAji before the reduction, then after the reduction the stack will contain aA U. Clearly, if further parsing is to be possible, one of A < U, A = U, or A > U must be true. If A ? U, a syntax error will eventually be detected. This error can be detected earlier by explicitly checking for the error condition A ? U immediately after each reduction.

Another improvement suggested by Leinius has nothing to do with error detection but does significantly speed up the parser. When the tail of the handle is found in the original simple precedence parser, the parser must search down through the stack for a symbol S, such that S,_1 < Sr The symbol S, is the head of the handle. Leinius points out that a pointer to such a symbol S, can be placed on an auxiliary head stack when the symbol is shifted onto the parse stack. Then when the tail of the handle is found and a reduction is necessary, the location of the head of the handle is in the top element of the head stack.

In order to neatly join the simple precedence parser and the Graham-Rhodes error-recovery algorithm, we make the parser into a subalgorithm (procedure) SP_PARSE which is called by a supervisory algorithm SP_SUPERVISOR. Procedure SP_PARSE incorporates the modifications which we have been discussing and sets an accept flag to true if the input string is parsed without errors. Algorithm SP_SUPERVISOR does the necessary initializations for Procedure SP_PARSE and calls the recovery procedure SP_RECOVER when necessary.

Algorithm SP_SUPERVISOR. This is the supervisory algorithm for the simple precedence parser and recovery system. It contains and initializes the global data structures used throughout the system. These data structures are: S, the parse stack, whose top entry is indicated by TOP; H, the head stack, whose top entry is indicated by HEAD, the input string x1x2…xn#, with index k and current input symbol INPUT, and a logical variable, ACCEPT. Algorithm SP_SUPERVISOR calls Procedures SP_PARSE and SP_RECOVER.

tmp245-824_thumb

Procedure SP_PARSE. This is a modified and improved simple precedence parser. It is called by Algorithm SP_SUPERVISOR and uses the global data structures defined in that algorithm. A local variable i is also used. Procedure SP_PARSE calls Procedure SP_ERROR to print syntax error messages. Note that the two special symbols ‘#’ and have the following relations with respect to the other vocabulary symbols:

tmp245-825_thumb

1. [Determine parse action]

Repeat through step 5 while true

tmp245-826_thumb

 

 

 

tmp245-827

Procedure SP_ERROR(i). This procedure prints a syntax error message for Procedure SP_PARSE. In an actual compiler the error messages generated here should be accompanied by information pinpointing the exact location of the syntax error in the source program. SP_ERROR has one local variable m, and one parameter i, which is the position in the stack of the first unparsible part of the phrase.

tmp245-828_thumb

We now present the Graham-Rhodes recovery scheme in the form of Procedure SP_RECOVER, which is called by Algorithm SP_SUPERVISOR. Procedure SP_RECOVER has the task of modifying the parse stack and input stream so that a detected syntax error may be safely bypassed and parsing may continue. When Procedure SP_RECOVER is called, the point at which the syntax error was detected is between the top symbol on the stack and the input symbol in INPUT. We will call this error position The Graham-Rhodes algorithm recovers well from syntax errors because it does not immediately attempt to fix up the stack and input based on a small fixed number of context symbols around Since an error can be detected by the parser an arbitrary number of symbols after the actual programmer error, no fixed amount of context is ever enough for all error cases. In order to get the effect of an unbounded amount of context in the error recovery, Graham and Rhodes have introduced a condensation phase, which precedes the correction phase of the error recovery.

The condensation phase performs reductions on the stack (the backward move) and parses forward in the input (the forward move) in order to condense contextual information about the error into as few symbols as possible. The condensation phase consists of two backward moves and a forward move.

The first backward move assumes that the ?! represents a > relation between the stack top symbol and the input symbol in INPUT. Thus the stack top symbol is the tail of a phrase, and a reduction may be performed if the following three conditions are met:

1. The string of stack symbols to be reduced matches the right-hand side of some production in the grammar.

2. When the symbols at the top of the stack are replaced by the left-hand side of the production, there must be a valid precedence relation (not ?) between the new top symbol of the stack (the left-hand side of the production) and the symbol below it on the stack.

3. After the reduction, the symbols on top^of the stack must still form a valid prefix of the right-hand side of some production of the grammar.

After the first backward move has condensed the information on the top of the parse stack, the forward move attempts to gather contextual information by parsing forward in the input. This is done by pushing the special marker onto the stack, pushing the input symbol INPUT onto the stack, and restarting the parser SP_PARSE. The parser will eventually detect another syntax error and return. We call the point of this second error ?2.

After the parser returns from its forward move, the second backward move is performed on top of the stack in the same way as the first backward move. The second backward move attempts to condense the information gathered by the forward move.

Once the condensation phase has compressed the contextual information about the syntax error into as few symbols as possible, the correction phase has the task of changing the string of symbols around the error into a phrase that appears to be correct. Rhodes (1973) presents several options which change the complexity and performance of the correction phase. The correction phase described here embodies options which provide simplicity and efficiency without sacrificing good recovery.

The correction phase consists of two parts. The first part computes three recovery sets, one set for each of three correction candidates. The three correction candidates are the strings of stack symbols between and ?2, between the latest < relation preceding and itself, and between the latest < relation preceding 11 and ?2. The recovery sets are sets of locally syntactically correct replacement strings for each of the correction candidates. A replacement string is locally syntactically correct if its left context symbol has a valid precedence relation to its leftmost symbol and its rightmost symbol has a valid precedence relation to its right context symbol.

The second part of the correction phase computes the cost of changing each of the correction candidates into each of its possible replacement strings. This cost computation is based on the costs of inserting, deleting, and replacing each terminal symbol of the grammar. The compiler writer supplies four vectors I, D, R, and T. I(x) is the cost of inserting the terminal symbol x, D(x) is the cost of deleting x, and R(x) is the cost of replacing x with the terminal symbol T(x). The R and T vectors are supplied because certain common errors involve accidental substitutions of symbols, for example ‘ = ‘ instead of ‘:= ‘. After the costs have been computed, the correction candidate and replacement string which together give the lowest cost estimate are used to make the error recovery. The correction candidate is replaced by the replacement string, the extra bottom marker is removed from the stack, and parsing may continue.

Procedure SP_RECOVER. This is a simple precedence error-recovery procedure called by Algorithm SP_SUPERVISOR. This procedure uses the global data structures defined in that algorithm. Three correction candidates, CCj, CC2, and CC3, and three recovery sets, R1; R2, and R3, are used, as well as local variables PREV_LT, TEMP, j, c, and Also, the variables MIN and JMIN are used in determining the least-cost replacement, PHIMIN. Procedure SP_RECOVER calls Procedure SP_BACKWARD to perform a backward move, and Function SP_COST to compute the cost of a replacement.

tmp245-829

 

 

 

 

tmp245-830

Procedure RESET(CC, RS). Given the two parameters CC and RS which hold the correction candidate and the recovery string, respectively, this procedure resets global variable TOP as well as the head stack index, HEAD. TOP will be incremented or decremented correctly depending on whether or not the recovery string adds symbols to the stack or deletes them. For each that is encountered in the correction candidate, HEAD will be decremented by 1. The variables c and r are local and hold the current symbol being looked at from the first and second parameters, respectively.

tmp245-831_thumb

As an example of this error-detection and -recovery technique, consider the following simple grammar:

tmp245-832_thumb

 

tmp245-833_thumb

This sample grammar, which is taken from the programming language Ada, gives the syntax for a simplified version of a record definition. It is not intended to be a complete definition of a record description in Ada, since only simple integer objects are allowed. It should be noted that the terminal symbol ‘expr’ denotes any (possibly parenthesized) integer arithmetic expression involving numbers and/or arithmetic operators. Integer variables are denoted by the terminal ‘id’. The precedence matrix for the example grammar is given in Table 7-25.

The following example is used to demonstrate the simple precedence parsing and error-recovery scheme as shown by Algorithm SP_SUPERVISOR and its associated subalgorithms

tmp245-834_thumb

The variables ‘num’ and ‘MAX’ are read as identifiers (id) while ‘(min + 1)’ represents an integer expression denoted by ‘expr’. In this example the keyword range is missing after the colon. Given the costs of certain insertions and deletions, the recovery method will detect the error and, furthermore, will be able to insert the keyword into its proper place in order to continue parsing to the end of the input string.

In connection with the following discussion, the reader is referred to the table showing the various stacks and variable values.

Control is passed back to the supervisor, and since ACCEPT is still false, the recovery procedure, SP_RECOVER, is invoked.

The first backward move is attempted and fails, as nothing can be reduced prior to the point of error.

A forward move is done by pushing an error delimiter (@) and the current input symbol on the stack and then calling the parser once more. Before calling Procedure SP_PARSE, we have the position shown in step 5. When processing a forward move, it is assumed that the new symbol on top of the stack is the head of some simple phrase.

Table 7-25 Simple precedence parsing table in record-description example

(rec def) (comp list) (obj decls)

(delim) (cons)

record end null range

>

: id

expr

(rec def)

tmp245-835 tmp245-836 tmp245-837 tmp245-838 tmp245-839 tmp245-840

(comp list)

tmp245-841 tmp245-842 tmp245-843 tmp245-844 tmp245-845 tmp245-846

(obj decls)

tmp245-847 tmp245-848 tmp245-849 tmp245-850 tmp245-851 tmp245-852

(delim)

tmp245-853 tmp245-854 tmp245-855 tmp245-856 tmp245-857 tmp245-858

(cons)

tmp245-859 tmp245-860 tmp245-861 tmp245-862 tmp245-863 tmp245-864

record

tmp245-865 tmp245-866 tmp245-867 tmp245-868 tmp245-869 tmp245-870

end

tmp245-871 tmp245-872 tmp245-873 tmp245-874 tmp245-875 tmp245-876

null

tmp245-877 tmp245-878 tmp245-879 tmp245-880 tmp245-881 tmp245-882

range

tmp245-883 tmp245-884 tmp245-885 tmp245-886 tmp245-887 tmp245-888

<

tmp245-889 tmp245-890 tmp245-891 tmp245-892 tmp245-893 tmp245-894
tmp245-895 tmp245-896 tmp245-897 tmp245-898 tmp245-899 tmp245-900

id

tmp245-901 tmp245-902 tmp245-903 tmp245-904 tmp245-905 tmp245-906

expr

tmp245-907 tmp245-908 tmp245-909 tmp245-910 tmp245-911 tmp245-912

All blank entries denote an error (?) relation.

Table 7-26 Simple trace in record-description example illustrating error recovery

Step

TOP

k

HEAD

INPUT

tmp245-913 tmp245-914 tmp245-915

H

S

PR

1

1

1

0

tmp245-916 tmp245-917 tmp245-918

2

2

2

1

tmp245-919

2

tmp245-920 tmp245-921

3

3

3

2

tmp245-922

3

23

tmp245-923 tmp245-924

4

4

4

2

tmp245-925

3

23

tmp245-926 tmp245-927

5

6

5

3

tmp245-928

3

5

236

tmp245-929 tmp245-930

6

6

5

3

tmp245-931

6

5

236

tmp245-932 tmp245-933

7

7

6

3

tmp245-934

6

5

236

tmp245-935 tmp245-936

8

8

7

4

tmp245-937

6

5

2368

tmp245-938 tmp245-939

9

8

7

3

tmp245-940

6

5

236

tmp245-941 tmp245-942

10

6

7

3

tmp245-943

6

5

236

tmp245-944 tmp245-945

11

5

6

3

tmp245-946

6

5

5

236

tmp245-947 tmp245-948

12

6

7

3

tmp245-949

6

236

tmp245-950 tmp245-951

13

8

8

4

tmp245-952

3

7

8

2368

tmp245-953 tmp245-954

14

8

8

4

tmp245-955

8

7

8

2368

tmp245-956 tmp245-957

15

7

8

2

tmp245-958

3

23

tmp245-959 tmp245-960

16

3

8

2

tmp245-961

3

23

tmp245-962 tmp245-963

17

3

8

1

tmp245-964

2

2

tmp245-965 tmp245-966

18

4

9

1

tmp245-967

2

2

tmp245-968 tmp245-969

19

5

10

1

tmp245-970

2

2

tmp245-971 tmp245-972

20

6

11

1

tmp245-973

2

2

tmp245-974 tmp245-975

21

2

11

1

tmp245-976

2

2

tmp245-977 tmp245-978

In this case, the assumption is true; so the parser can move ahead in the input string with no errors until step 10. At this step, (cons) .. (cons) has just been reduced to (delim). Now a check is done to make sure that all symbols from the previous phrase head to the new top symbol form a valid prefix. The check fails because of the previous error, and the current position is now as in step 11.

Returning to the recovery procedure, the second backward move cannot be done; so the three correction candidates with their corresponding a’s and y’s are chosen as

tmp245-979_thumb

The Rj’s are all equal to the empty set, and thus HEAD is decremented by 1 and control is passed back to the supervisor. The error delimiter is left on the stack and the parser is called again as if no previous interruption had occurred. In this case, upon comparing @ and (delim), a « relation is found and all variables are updated as in the normal case. Following this we again run into an invalid prefix problem. The nonterminal (delim) is actually = to ‘;1, but because of the previous error (delim) is assumed to be the head of a simple phrase and consequently (delim) followed by ‘;1 is not a valid prefix of any right-hand side.

No more can be done in the parsing algorithm at this stage; so we return to the supervisor, which invokes the recovery procedure just as if it was the first error being encountered. The backward-move attempt fails once more, and the stacks now contain the values as in step 13. Neither a forward move nor a second backward move is possible at this time, and three new correction candidates are chosen:

tmp245-980_thumb

This time,

tmp245-981_thumb

Presumably the first listed choice from R3 will be the least-cost replacement, PHIMIN, and thus CC3 will be replaced on the parse stack by

tmp245-982_thumb

RESET is now invoked with step 14 values. For each of the two error symbols (@’s) found in CC3, HEAD is decremented by 1. The first one is replaced by the missing keyword range, which causes no change to TOP, but the second is replaced by the empty string, and therefore TOP is decremented by 1 before going back to the parser for the last time with the position as in step 15.

There are no more errors, and the corrections made are indeed the right ones; so the parse will terminate successfully.

Rhodes (1973) states that "Some sort of fail-safe device must be incorporated into the system in order to guarantee that the recovery system terminates, i.e., that it does not get into an infinite loop of trying various corrections to the same error." In his implementation of the system, Rhodes specifies that if no new input symbols have been read after two attempts at recovering from an error, the recovery has failed. In this case Rhodes resorts to panic mode (Sec. 5-4.3) to recover from the error.

Graham and Rhodes implemented their error-recovery method in a parser for a subset of ALGOL and in a PASCAL parser. Their error-recovery results were compared with those obtained in the PL/C and Zurich PASCHAL compilers.

While these comparisons are somewhat subjective, their method compared quite favorably, with the diagnostics of the latter tending to be more intelligible and fewer errors being left undetected.

In closing we point out that the Graham-Rhodes recovery method can easily be modified to work with any precedence parsing method. In particular, the Graham-Rhodes technique is appropriate for recovery in the extended precedence and mixed-strategy precedence parsers described in the following section.

Next post:

Previous post: