Top-Down Parsing (Compiler Writing) Part 4

LL(1) Grammars with e-Rules

In this subsection we generalize the LL(1) grammars introduced in the previous subsection to include e-rules (see Sec. 2-4.3). We will see that such an inclusion defines a larger class of grammars and that the corresponding parsing algorithm is substantially more complex.

Before we proceed with the discussion, however, we must generalize the FIRST sets defined in the previous subsection so that they may contain the empty string. Therefore, the FIRST set of a string a is defined as

tmp9-280_thumb[2][2]_thumb

Note that this definition allows strings of length 0 and 1.

As an example to be used throughout this section, we use the grammar whose rules are:

tmp9-281_thumb[2][2][2]


In constructing this simple grammar, we are careful to-avoid the introduction of any left recursion. It is easy to do this because the grammar is very simple. Any discussion of top-down parsing, however, is incomplete if no mention is made of left recursion—the bane of any top-down parse technique—and its elimination. We can check a grammar for having nonterminals that are left-recursive by testing for the following:

Iftmp9-282_thumb[2][2][2]Aa, for some a, then A is left-recursive.

A is left-recursive if and only if A can be expanded into a string that begins with A. A rule is said to be direct left-recursive if it is the form A —► Aji. The relation F+ introduced at the end of the previous subsection when modified to incorporate empty rules is a convenient method for determining left-recursive nonterminals. This will be done later in this subsection.

We now turn to an informal discussion of a parsing algorithm for handling LL(1) grammars in which e-rules are permitted. Using the sample grammar introduced at the beginning of this subsection and the top-down parsing approach based on a single lookahead symbol, introduced earlier, we attempt to construct the parse for the input i[e] *- e#. The leftmost derivation for this string is

tmp9-284_thumb[2][2][2]

and the associated syntax tree is given in Fig. 6-12.

As before, the stack is initialized to contain the right-hand side of production 0:

A#

From this initial configuration, a rule must be selected on the basis of the stack symbol A and the current input symbol i. In this instance such a deterministic choice can clearly be made by choosing rule f. This results in a stack contents of

tmp9-285_thumb[2][2][2]

Since the stack top symbol and the current input symbol match at this stage, the stack symbol is discarded and the input cursor is advanced. As a result, the stack contents are changed to tmp9-288_thumb[2][2][2] and the new current input is [.

Parse tree for the string

Figure 6-12 Parse tree for the string tmp9-291_thumb[2][2][2]

Because the stack and the new input symbols match, the stack symbol is discarded and the next input character ] is obtained, thus giving a stack contents of

tmp9-292_thumb[2][2][2]

For the second time in this parse we encounter a nullable nonterminal. Note that the input symbol ] does not belong to the FIRST sets associated with productions 6 and 7. It seems that we can go no further with our present parsing process. This is indeed the case! We next develop, however, an informal modification and extension to the current parsing algorithm which will permit the parsing of the given string.

At the present position in the parse, it is obvious that rule 6 must not be selected. If we are to continue, rule 7 must be chosen. In order to avoid backup, we must be sure that this is the correct choice. Since for the first time we are faced with the application of an e-rule, we must develop a strategy for this case which avoids backup. The nonterminal C will be deleted as a direct result of an e-rule selection. Once this is done, we must have a match between the current input symbol and new stack symbol. Note that this is indeed the case if C is popped from the stack. In short, the symbol ] in our sample grammar can immediately follow a nullable nonterminal. It is this follow relation that we must now describe.

For a nullable nonterminal A, we define the following sets:

tmp9-293_thumb[2][2][2]

and S’ is the start symbol of the grammar) In the present case

tmp9-294_thumb[2][2][2]

Also, note that the FIRST sets associated with rules 6 and 7 and the FOLLOW set for nonterminal C are mutually disjoint. Consequently, no confusion can arise in choosing the next production to apply.

After the application of the ruletmp9-295_thumb[2][2][2]and the popping of ] from the stack, the current stack contents are

tmp9-297_thumb[2][2][2]

Since the current input symbol is <— , it is clear that rule 2 is not the required production. This leaves rule 3 as the only remaining alternative. It was mentioned earlier that B is a nullable nonterminal. The FOLLOW set for this symbol is

tmp9-298_thumb[2][2][2]

and this set and the FIRST sets for rules 2 and 3 are mutually disjoint. Consequently, by selecting rule 3 we are certain that no backup will be required later. A trace of the entire parse is given in Table 6-7.

The previous notions are formalized in the following definition.

Definition 6-3 A grammar G is LL(1) if and only if for each pair of rules

tmp9-299_thumb[2][2][2]

for some nonterminal A,

tmp9-300_thumb[2][2][2]

An equivalent but simpler formulation of the LL(1) condition is as follows:

For all rules

tmp9-301_thumb[2][2][2]

For all rules

tmp9-302_thumb[2][2][2]

We shall leave it as an exercise to show that the second simpler formulation is equivalent to the first.

Table 6-7 Trace of a parse for the stringtmp9-303_thumb[2][2][2]

Trace of a parse for the string

We define first a parsing function which, when given the symbol on top of the stack and the next input symbol, returns as a value either the production to apply or else an indication of how to continue or to terminate. This function is given by

tmp9-306_thumb[2][2][2]

where # marks the bottom of the stack and the end of the input string, and where (/?, /) is an ordered pair such that /? is the right-hand side of production number /. If A is the symbol on top of the stack and a is the current input symbol, then M is defined as follows:

tmp9-307_thumb[2][2][2]

In order to obtain the parsing table for a given LL(1) grammar, we require the FIRST and FOLLOW sets. It is important to note that the presence of e-rules affects the computation of both FIRST and FOLLOW sets.

Let us investigate the effect of e-rules on the computation of the FIRST sets. In order to do so, it is convenient to redefine the leftmost relation F of the previous subsection so as to take into consideration the e-rules in the grammar. The nullable nonterminals (see Sec. 2-4.3) can be computed using Algorithm EMPTY of Chap. 2. As before, we denote the set of nullable nonterminals by VN .

Given a productiontmp9-308_thumb[2][2][2]we have XFY1, and iftmp9-309_thumb[2][2][2]we also have XFY2, and iftmp9-310_thumb[2][2][2]we also have XFY3, etc. If the sets VN and VT contain m and n elements, respectively, these relations can be represented bytmp9-311_thumb[2][2][2]bit matrices, as was discussed in Sec. 2-1.2.

Note that production 0 of the grammar which involves the end-marker symbol # is to be used in the computations.

We can use Warshall’s algorithm to form the transitive closure of F to obtain F+. The FIRST sets for each nonterminal of the grammar can be obtained directly from the F+ matrix.

These FIRST sets are given by

tmp9-316_thumb[2][2][2]tmp9-317_thumb[2][2][2]

As pointed out earlier, this computation can provide a test for left recursion. If fortmp9-318_thumb[2][2][2]holds (i.e., there is a 1 on the main diagonal of F+ for a nonterminal), then A is a left-recursive nonterminal. FIRST(a) fortmp9-319_thumb[2][2][2]and

tmp9-320_thumb[2][2][2]containstmp9-321_thumb[2][2][2]then also all terminal symbols of FIRST(x2), etc. If alltmp9-322_thumb[2][2][2]then FIRST(a) also contains e.

Table 6-8 F, F and F* matrices and FIRST sets for sample grammar

F, F and F* matrices and FIRST sets for sample grammar

F entries are marked by F. F+ entries are marked by F or F+. F* entries are marked by F or F+ or F*.

Contributors to a FIRST set are F+ entries in VN X VT and diagonal elements of VTX VT.

In order to compute the FOLLOW sets, we also require the reflexive transitive closure of F+, which we denote by F*. Assuming that there are no left-recursion problems, we proceed to obtain F* by ORing the bit identity matrix with F+. This is readily accomplished by setting a 1 on each position of the main diagonal of F+. The relations F, F+, F*, and FIRST for our sample grammar are shown in Table 6-8.

The computation of the FOLLOW sets is simplified by the introduction of the following two relations B and L: Given a productiontmp9-329_thumb[2][2][2] tmp9-330_thumb[2][2][2]we also lettmp9-331_thumb[2][2][2]etc.

Given a productiontmp9-332_thumb[2][2][2] tmp9-333_thumb[2][2][2]we also lettmp9-334_thumb[2][2][2]The

transitive and reflexive transitive closures of L can easily be obtained, and these are denoted by L+ and L*, respectively. Recall that the FOLLOW sets for each tmp9-335_thumb[2][2][2]were defined as FOLLOWtmp9-336_thumb[2][2][2]

The follow set for A can be equivalently represented by Fig. 6-13. From the diagram it is clear that AL*X, XBY, and YF*z, from which we obtain FOLLOWtmp9-337_thumb[2][2][2]

Table 6-9 contains the L, L+, and L* relations and Table 6-10 exhibits the B and (L*BF*) relations along with the FOLLOW sets for the sample grammar. The parsing table can now be generated by using the definition of the M function given earlier. The parsing function obtained in this manner for our sample grammar is given in Table 6-11. Note that forwe say that e is in FIRSTS).

Table 6-9 L, L+, and L* matrices for sample grammar

 L, L+, and L* matrices for sample grammar

l. entries are marked by L. L* entries are marked by L or L*. L* entries are marked by L or L+ or L*

Tree representation of a FOLLOW set computation in terms of relations.

Figure 6-13 Tree representation of a FOLLOW set computation in terms of relations.

Table 6-10 B, (L*BF*) matrices and FOLLOW sets for sample grammar

B, (L*BF*) matrices and FOLLOW sets for sample grammar

B entries are marked by B l*BF* entries are marked by B or n.

Contributors to a FOLLOW set are L*BF* entries in VN x VT.

Table 6-11 LL(1) parsing, matrix M for sample grammar

LL(1) parsing, matrix M for sample grammar

Blank entries are all error entries.

We now culminate the previous discussion with the presentation of an algorithm to parse strings according to the top-down approach based on one symbol lookahead. We assume that the productions of any grammar implemented according to this method have been pretested and all left recursion has been eliminated.

Algorithm LL(1)_PARSER. Given the parsing function M for a particular LL(1) grammar G whose VN and VT sets contain m and n elements, respectively, this algorithm parses strings to determine whether or not these strings belong to L(G). The input string is stored in the string variable T. The variable p is the index of T that points to the current input symbol. The current sentential form is represented by a string variable, SENT, where t contains the leftmost symbol in the sentential form. Note that SENT is like a stack whose leftmost symbol is the stack top. The variable u contains the current input symbol. For convenience, we assume each terminal and nonterminal symbol is a string of length 1. The list of production numbers for the parse is represented by the string variable PARSE. The starting symbol of the unaugmented grammar is the string variable START. That is, START contains the left-hand side of production number 1.

tmp9-353_thumb[2][2][2]

 

 

tmp9-354_thumb[2][2][2]

Traces of this algorithm for the sample grammar with the stringstmp9-355_thumb[2][2][2]and tmp9-356_thumb[2][2][2]are given in Table 6-12.

This algorithm is very efficient in terms of its operational and storage requirements. In fact, the performance of the LL(1) parser is a linear function of the length of the given input string. For a string of length k the cost is cvk + c2k, where q and c2 are constants associated with the number of operations and space or storage requirements, respectively.

From the discussion dealing with the construction of the parser, it is clear that an LL(1) grammar must be unambiguous. The proof of this point is left as an exercise.

Another property of a grammar which satisfies the LL(1) condition is that such a grammar cannot contain left-recursive nonterminals. Let us give an informal proof of this fact. If X0 is a left-recursive nonterminal, then there exists a sequence of production applicationstmp9-357_thumb[2][2][2]such that tmp9-358_thumb[2][2][2]Clearly, all of the X, symbols are left-recursive nonterminals. At least

Table 6-12 Traces of two parses using sample grammar

Traces of two parses using sample grammar

one of these, however, must have another production if terminal strings are to be produced (the grammar is assumed to be reduced). Let this particular nonterminal be X and let its productions be

tmp9-364_thumb[2][2][2]

Now let us further assume without loss of generality that c^ is the right part of X which leads to the left-recursion problem. Thus we have

tmp9-365_thumb[2][2][2]

Sincetmp9-366_thumb[2][2][2]it is obvious that the LL(1) condition is not satisfied because the FIRST and/or FOLLOW sets associated with a! and a2 are not disjoint.

The role of e-rules in LL(1) grammars is important. Recall from Sec. 2-4.3 that context-free grammars with or without e-rules are equivalent in the sense that both types of grammars generate the same set of languages. This, however, is not the case for LL(1) grammars. In particular, the grammar

tmp9-368_thumb[2][2][2]

is LL(1), but the language it generates is not an ^’-language. That is, no .f-grammar exists which can generate this language. Kurki-Suonio (1969) has shown that the grammar is LL(1) (as the reader can readily verify) but an equivalent grammar without e-rules is not LL(1).

tmp9-369_thumb[2][2][2]

In fact, the equivalent grammar is LL(2).

One important aspect of top-down parsing, which up to this point has been ignored, concerns error detection and recovery. From our previous discussion, it is obvious that many entries of a typical parsing table are error entries. The next subsection examines how these entries can be used in the detection of and recovery from errors in LL(1) parsers.

Next post:

Previous post: