Top-Down Parsing (Compiler Writing) Part 5

Error Handling for LL(1) Parsers

The LL(1) parser described in the previous section performs a correct parse of any valid input string. Given an erroneous string, however, it parses to the point of the first error, prints ‘UNSUCCESSFUL PARSE’, and stops. Although this is acceptable behavior for an abstract parser, it is unacceptable in a practical compiler (see Sec. 5-1). This section discusses modifications to the LL(1) parser which allow it to react more intelligently to syntactic errors.

LL(1) parsers detect syntactic errors by means of the error entries in their parsing tables. When the symbol A on top of the stack and the input symbol ak give a table entry M(A, ak) = error, the parser knows that ak cannot validly follow the already parsed portion of the input, axa1…ak_l, and a syntactic error is reported.

LL(1) parsers possess the valid prefix property discussed in Sec. 5-2.3. This means that if the parser does not detect an error in the first portion a^a2… ak_l of an input string aya2… an, there must be some string of symbols akak+1… am such that axa2… am is a valid string of the language. Less precisely, we may say that a parser with the valid prefix property detects a syntactic error at the earliest point possible in a single left-to-right scan of the input with no backup. The valid prefix property eliminates the need for inserting or deleting stack symbols in LL(1) error recovery and repair. At any point in the parse the parser has already recognized a valid prefix, that is, the front part of some valid input string. Thus, when input symbol ak causes an error, the parser can always modify the unparsed input symbols ak…an so that the parse may continue.


The following LL(1) grammar, which generates simple arithmetic expressions, will be used for all examples in this section.

tmp9-378_thumb[2][2][2][2]

The LL(1) parsing table for this grammar is shown in Table 6-13.

Table 6-14 gives a trace of the parse of the input string a)# using the parsing table from Table 6-13. A syntactic error is detected in step 6 of the parse. The error is detected because the entry M(#, ‘)’) in Table 6-13 is blank {error). Notice that the error is detected at the input symbol ‘)’, the first erroneous symbol encountered in the left-to-right parse of the input. This early detection is guaranteed by the valid prefix property mentioned earlier.

Notice that the transition from step 4 to step 5 in Table 6-14 is unnecessary, because the first erroneous symbol is already at the beginning of the unparsed part of the input string. Furthermore, the transition causes the deletion of A from the parse stack, reducing the amount of syntactic information available for error recovery.

Table 6-13 LL(1) parsing table for sample grammar

M

a

(

)

+

#

E

(TAJ)

(TA, 1)

A

(e.3)

( + 7/1,2)

(£,3)

T

(a,4)

(( E), 5)

a

pop

(

pop

)

pop

+

pop

#

accept

Blank table entries represent error actions.

The presence of A on the stack indicates that the parser is ready to accept a string of the form + T + T + T…, that is, zero or more occurrences of the symbols + T, where T represents any string derivable from the nonterminal T in the grammar. If A is prematurely deleted from the stack, this information is lost. We can easily prevent A from being popped from the stack in such a case. Observe that wasted parse actions like the one just described occur only when the stack symbol is a nullable nonterminal, and the erroneous input symbol causes the nonterminal to be expanded to e (the terminal is in the FOLLOW set of the nonterminal). If we implement the parse stack as a vector with two top pointers, we can maintain the nullable nonterminal on the stack, while pretending to pop it. If no error occurs, we let the parser continue, but if an error is detected, we remember that the nonterminal is really still on the stack. Algorithm IMPROVED_LL(l) is a version of the LL(1) parser of the previous section and uses this method to retain nullable nonterminals in the event of an error.

Algorithm IMPROVED_LL(1). Given the parsing table M for a particular LL(1) grammar G, this algorithm parses strings to determine whether or not these strings belong to L(G). The input string is stored in the string variable T. The variable p is the index in T of the current input symbol. The parsing stack is a vector S with associated stack top indices TOP and TRUE_TOP.

Table 6-14 Trace of parse of input string a)#

INPUT

STACK

PARSE

1

a)#

E#

2

a)#

TA#

1

3

a)#

aA#

14

4

)#

A#

14

5

)#

#

143

6

Syntactic error detected

The variable t is used to hold the top symbol of the stack, and the variable u holds the current input symbol. PARSE is the list of production numbers which make up the parse of the input string. The goal symbol of the unaugmented grammar (the left-hand side of production 1) is initially in the variable START.

tmp9-379_thumb[2][2][2][2]

Now that we have the means to detect syntactic errors and to retain as much contextual information on the stack as is possible, we must devise ways of reporting and recovering from syntactic errors. We present two approaches to error reporting and two approaches to error recovery. In each case the first approach is of the ad hoc type discussed in Chap. 5. The second approach is a more systematic method.

The ad hoc reporting and recovery methods consist of designing an error message and a recovery routine for each of the error entries in the parsing table of the LL(1) parser. Table 6-15 shows our sample parsing table with the 28 error entries numbered. The compiler writer must consider the possible input symbols which can cause each error and then formulate a plausible response for each.

Table 6-15 LL(1) parsing table with numbered error entries

M

a

(

)

+

#

E

(TA,l)

(7H.1)

1

2

3

A

4

5

(e,3)

( + W.2)

(e.3)

T

M.4)

((E), 5)

6

7

8

a

pop

9

10

11

12

(

13

pop

14

15

16

)

17

18

pop

19

20

+

21

22

23

pop

24

#

25

26

27

28

accept

Consider the error entry numbered 1 in Table 6-15. This parsing action is taken when there is a nonterminal E on top of the stack, and the input symbol is ‘)1. Looking at the grammar, we find that a nonterminal E may arise from either of the expansions:

tmp9-380_thumb[2][2][2][2]

Since a ‘)’ appears where an E is required, the author of the erroneous input string must have either:

1. Inserted a ‘)’ at the start of the input, or

2. Omitted the expression between a pair of matched parentheses ‘ ( )’ •

The error message must indicate these possibilities to the programmer. Such an error message might be as follows:

MISSING EXPRESSION BETWEEN PARENTHESES, OR ‘) • AT START OF PROGRAM.

To recover from the error, the following algorithm might be invoked:

Algorithm LL_ADHOCl. p is the index into the input string and TOP is an index associated with the top of the stack.

tmp9-381_thumb[2][2][2][2]

When the input is ‘)a + a# the recovery routine deletes the ‘)’. When the input is ‘a + ( ) + a#\ the recovery routine changes the stack and the input string so that the missing expression problem can be ignored.

This error-handling strategy is an instance of the ad hoc approach to error handling. Here we see that the method can recover effectively from errors. We also see that the compiler writer must anticipate all possible errors in order to supply good messages and effective recovery. If the compiler writer makes a mistake or fails to anticipate some syntactic error, the compiler may become unstable and an avalanche of errors may result. There certainly are ample possibilities for mistakes in this ad hoc approach, since a process comparable with the discussion in preceding paragraphs must be performed for each error entry in the parsing table. While there are only 28 such entries in our small example, a practical compiler may have thousands of error entries in its parsing table.

The preceding discussion of ad hoc methods suggests that systematic techniques may be far simpler than ad hoc techniques. Systematic techniques either avoid the large amount of detail found in the ad hoc approach or else use some automated algorithm to generate the details of the error handler. We now consider systematic approaches to error reporting and error recovery in an LL(1) parser.

Errors in the LL(1) parser consist of a mismatch between some expected construction and the symbol actually occurring in the input string. A possible systematic approach to generating error messages consists of associating a string with each symbol of the grammar and printing error messages of the form:

EXPECTING string associated with symbol on the stack,

BUT FOUND string associated with the input symbol.

The strings to be associated with each symbol of the grammar can be easily generated as follows:

Algorithm COMPUTE_TS. This algorithm computes a table TS of terminal strings to be associated with each symbol of the grammar. The strings associated with the terminal symbols are the terminal symbols themselves. For nonterminals, the FIRST and FOLLOW sets are used to compute a set F consisting of the terminal symbols which can validly appear where the nonterminal is required. The string associated with a nonterminal consists of the names of terminals in its F set concatenated together and separated by the word ‘ OR’.

tmp9-382_thumb[2][2][2][2]

Table 6-16 gives the TS table for our sample grammar. Compare the message generated by the systematic error reporter for error entry number 1 with the ad hoc message designed earlier. The systematic error message for input symbol 1)’ and stack symbol E is

tmp9-383_thumb[2][2][2][2]

The systematic approach to error recovery which we describe here is based on Fischer, Milton, and Quiring (1977), and Fischer and Milton (1977). Gries (1971), Holt and Barnard (1976), and Irons (1963) also present systematic approaches to top-down error recovery; none of these methods are directly applicable to the table-driven LL(1) parser described in Sec. 6-2.4.

The Fisher, Milton, and Quiring method uses error repair as a means of error recovery. Furthermore, it modifies only the unparsed portion of the input string in making its repair. Since nothing is done to the stack, it is never necessary to withdraw translation actions caused by input symbols already parsed. The error-repair method consists of Algorithm LL_REPAIR, which is called by the ERROR_HANDLER of Algorithm 1MPR0VED_LL(1). When a syntactic error is detected, the parse stack consists of the symbols S,S!_l… Sv where each S, is a terminal symbol, a nonterminal symbol, or the marker symbol #. The input string is al… ak… an+l where each a, is a terminal symbol, except for a„ + 1, which is the marker symbol #. The symbol ak has caused the error. In order to repair the syntactic error, Algorithm LL_REPAIR deletes symbols ak…aJ_1 so that the next input symbol is aj (j may be equal to k, i.e., there may be no deletion). Then it inserts symbols preceding a} so that aj will be accepted after the inserted symbols have been parsed. This can only be done if for some Sm in the stack,tmp9-384_thumb[2][2][2][2]that is, Sm generates some terminal string which contains ay. If this is true, the insertion task is relatively simple. For each stack symbol S, above Sm on the stack, we insert a terminal string which S, can generate. Then, iftmp9-385_thumb[2][2][2][2]the terminal string y (called the prefix of Sm and Oj) is inserted into the input string.

Table 6-16 TS table for sample grammar

Symbol x

TS(x)

a

tmp9-388

(

tmp9-389

)

tmp9-390

+

tmp9-391

#

tmp9-392

E

tmp9-393

A

tmp9-394

T

tmp9-395

If the parser is now restarted, it can match the contents of the input string with the contents of the stack to get a successful parse.

Algorithm LL_REPAIR first performs a deletion of symbols from the input string and then performs an insertion of symbols into the input string. In most repair situations, the algorithm must choose from a large number of possible deletion-insertion sequences. The algorithm makes its choice based on a system of weights or "costs." A cost is associated with the deletion of a symbol and with the insertion of a symbol. Algorithm LL_REPAIR selects its repair strategy so that the total cost of the operations it performs is minimized in a local context. Compiler writers choose values for the costs of inserting or deleting each terminal symbol of the grammar. They choose these costs based on their knowledge of the language being parsed. They assign a low insertion cost to symbols likely to be omitted by a programmer, and a low deletion cost to symbols likely to be accidentally inserted by a programmer.

Given that input symbolstmp9-396_thumb[2][2][2][2]have already been deleted from the input string, the cost of performing the insertion task can be easily calculated. If Sm is the highest symbol in the stack such thattmp9-397_thumb[2][2][2][2]then the inserted symbols will consist of the least-insertion-cost strings derivable from each oftmp9-398_thumb[2][2][2][2](called the LCS strings for these symbols) and the string y (the prefix of Sm and aj). The cost of the insertion is the sum of the insertion costs of each symbol in these strings. To ease the calculation of this total cost of insertion, Algorithm LL_REPAIR uses two cost tables. IC is a table containing the insertion costs of terminal symbols (as chosen by the compiler writer) and the insertion costs of nonterminals (the costs of the least-cost strings of terminals derivable from the nonterminals). PC is a table containing the insertion cost of the prefix of a nonterminal A and a terminal a. Thus the total cost of the insertion is

tmp9-402_thumb[2][2][2][2]

The actual string inserted is

tmp9-403_thumb[2][2][2][2]

It is useful to settmp9-404_thumb[2][2][2][2]These infinite costs indicate that a prefix does not exist for the symbol pair (A, a).

In order to choose its repair strategy, Algorithm LL^REPAIR computes the total deletion and insertion cost of each possible repair and selects the repair with the lowest cost. The deletion cost of each terminal symbol is in the DC table. Since the deletion cost is strictly increasing as more and more symbols are deleted, it often happens that the best total cost to date is less than or equal to the deletion cost at some point. In such a case, it is unnecessary to look further for a repair with a lower total cost, because the deletion cost will dominate all such repairs.

Table 6-17 Repair tables for sample grammar

IC

DC

LCS

a

4

a

4

£ a

(

3

(

2

A e

)

2

)

1

T a

+

1

+

3

E

4

A

0

T

4

#

5

£

0

Prefix table

PC table

a

(

)

+

a

(

)

+

#

E

£

£

(a

a

E

0

0

7

4

oo

A

+

+

+ (a

£

A

1

1

8

0

00

T

£

£

(a

(a

T

0

0

7

7

00

Table 6-17 gives the values for all the tables required for our sample grammar.

The complete error-repair algorithm is as follows:

Algorithm LL_REPAIR. This algorithm modifies the input string T of Algorithm IMPROVED_LL(1) so that the improved LL(1) parser may continue parsing. LL.REPAIR uses the IC, DC, LCS, PREFIX, and PC tables previously described. It uses the values of variables p and TRUE_TOP from Algorithm IMPROVED _LL(1) and examines the contents of the parse stack S. Local variables C, CBEST, CDEL, and PCT are used to compute costs. E is a variable containing the input symbol being examined. The variables q and i are integer indexes. qBEST is the value of q corresponding to the current optimum.

1. [Initialize costs and select first input symbol]

tmp9-406_thumb[2][2][2][2]

2. [Choose optimal number of deletions]

Repeat through step 5 while

tmp9-407_thumb[2][2][2][2]

Repeat through step 5 while

tmp9-408_thumb[2][2][2][2]

3. [Calculate cost of insertion at this point

tmp9-410_thumb[2][2][2][2]

4. [See if it is cheaper to insert here than at previous best point

tmp9-412_thumb[2][2][2][2]

6. [Deletion is now chosen. Perform insertions before pth symbol]

tmp9-413_thumb[2][2][2][2]tmp9-414_thumb[2][2][2][2]tmp9-415_thumb[2][2][2][2]

We now consider a simple example using our sample grammar. Given the input string a + + (a)#, our IMPROVED_LL(1) parser halts with TA# on its parse stack and p = 3 (the reader is encouraged to verify this). The stack symbol T and input symbol + lead to an error action. The ERRORJHANDLER generates an error message and calls LL_REPAIR. LL_REPAIR considers each of the following possible repairs on the basis of their combined insertion and deletion costs:

Deleting no symbols, inserting (a, gives ja + (a + {a)#, cost = 7 Deleting one symbol, inserting e, gives a + (a)#, cost = 3 Deleting two symbols, inserting e, gives a + a)#, cost = 5

Notice that the deletion cost CDEL = 5 is now greater than the minimum cost CBEST = 3. Since we are seeking to minimize the cost and since CDEL can never decrease, there is no point in considering further deletions. The least-cost repair is to delete the + and insert the empty string. The parser is restarted with TA# on the stack, input string T = a + (a)#, and p = 3. The reader should trace the repair algorithm in detail and observe how the cost minimization is performed.

Another example illustrates the choice of an insertion or deletion based on the insertion and deletion costs chosen by the compiler writer. Given the input string aa# the parser stops with A# on the stack, and p = 2. Algorithm LL_REPAIR considers the following alternatives:

Deleting no symbols, inserting + , gives a + a#, cost = 1

Deleting one symbol, inserting e, gives a#, cost = 4

Since the deletion cost CDEL is now 4 and the minimum cost is 1, no further deletions are considered. Because the cost of deleting or inserting a was set to a high value by the compiler writer, the repair algorithm avoids inserting or deleting a. In a more elaborate grammar the terminal a would be replaced by an identifier or number. Experience indicates that operands are less likely to be accidentally inserted or omitted than are operators.

Algorithms COMPUTE_COST and COMPUTE_PREFIX, which follow, are used to compute the repair tables necessary for the Fischer, Milton, and Quiring repair method. Algorithm COMPUTE_COST calculates the insertion costs of nonterminal symbols of the grammar based on the rule that the cost of a string of symbols is equal to the sum of the costs of the individual symbols in the string. This rule is applied to the right-hand side of each production of the grammar to get the cost of the nonterminal on the left-hand side of the production. When a nonterminal has several right-hand sides, the cost of the nonterminal is set to the cost of the minimum-cost right-hand side. Thus, the entry for a nonterminal in the IC table is the insertion cost of the minimum-cost terminal string derivable from the nonterminal. The LCS table, which contains the minimum-cost strings themselves, is initialized at the same time as the insertion costs are computed.

Initially, only the costs of the terminal symbols are known. Based on these costs, the algorithm computes the costs of nonterminals which have right-hand sides consisting only of terminals. Once the costs of these nonterminals are known, the costs of nonterminals with these nonterminals in their right-hand sides may be computed. This process continues in a "bottom-up" fashion until no changes occur in one complete pass through the grammar. The computation is then complete.

Algorithm COMPUTE_COST. Given the grammar and the insertion costs for the terminals of the language, this algorithm calculates the insertion costs for all nonterminals of a language. It also computes the LCS table values. A logical variable NOCHANGE is used to control the main loop of the algorithm.

tmp9-416_thumb[2][2][2][2]

The concept of a prefix was introduced in the discussion preceding Algorithm LL_REPAIR. If a nonterminal A is such thattmp9-417_thumb[2][2][2][2]and y g then y is called a prefix of the pair (A, a). The PREFIX table computed by Algorithm COMPUTE_PREFIX contains a prefix for each pair consisting of a nonterminal A and a terminal a. In some cases such a prefix may not exist (i.e., tmp9-418_thumb[2][2][2][2]In other cases there may be more than one prefix for a given pair.

In such a case, the prefix with the lowest insertion cost is chosen. The PC (prefix cost) table contains the insertion cost of each PREFIX entry. If there is no prefix for (A, a), thentmp9-419_thumb[2][2][2][2]In addition,tmp9-420_thumb[2][2][2][2]

Algorithm COMPUTE_PREFIX operates in a "bottom-up" fashion similar to Algorithm COMPUTE_COST. The first iteration through step 3 computes prefixes for pairs (A, a) wheretmp9-421_thumb[2][2][2][2]Once this pass is completed, prefixes can be computed for some pairs (B, a) wheretmp9-422_thumb[2][2][2][2] ya… . This is possible because prefixes have already been computed for the nonterminals A. The process continues, adding one level of derivation with each iteration, until one complete step occurs without changing the table. At this point, the final prefixes have been computed for all pairs of a nonterminal and a terminal. The prefix cost table PC is set to the insertion cost of each prefix as it is computed, and is used to choose the minimum-cost prefix where more than one prefix exists.

Algorithm COMPUTE_PREFIX. This algorithm computes the PREFIX and PC tables used by Algorithm LL_REPAIR. It requires the IC and LCS tables from Algorithm COMPUTE_COST. Temporary variables min, i, j, k, and C are used by the algorithm.

tmp9-429

In this section we have presented alternative means of detecting, reporting, and recovering from syntactic errors in table-driven LL(1) parsers. The reporting and recovery methods contain examples of the ad hoc and the systematic approaches to error handling. The papers by Fischer, Milton, and Quiring (1977), and Fischer and Milton (1.977) contain recommendations for improvements to the algorithms given in the last part of this section. Further variations of the method can be found in Fischer, Milton, and Quiring (1980).

Next post:

Previous post: