Bottom-Up Parsing (Compiler Writing) Part 11

LALR(l) Parsers

In this subsection we present a parsing technique known as LALR(\) parsing. This method of parsing is similar to but more powerful than the SLR{ 1) method because it uses lookahead information which can be obtained, for example, from the construction process given in the previous subsection. Although an LALR( 1) parser is not as powerful as an LR( 1) parser, the former offers the important advantage of requiring much less space for its parsing tables than the latter. The LALR( 1) parsing tables can be obtained directly from the sets of LR(l) items obtained previously.

Table 7-40 LALR( 1) item sets for an example grammar

State

LALR( 1) item

Next state or reduce

[0]

tmp245-1876

[11]

tmp245-1877

[1]


tmp245-1878

[3]

tmp245-1879

[2]

tmp245-1880

[3]

tmp245-1881

[2]

tmp245-1882

[1]

[11

tmp245-1883

[4]

tmp245-1884

[5]

[21

tmp245-1885

Reduce 3

tmp245-1886

[9]

[3]

tmp245-1887

Reduce 2

tmp245-1888

Reduce 5

HI

tmp245-1889

[6]

tmp245-1890

[2]

tmp245-1891

[7]

tmp245-1892

[2]

tmp245-1893

[6]

[5]

tmp245-1894

[8]

tmp245-1895

[7]

tmp245-1896

[81

[6]

tmp245-1897

Reduce 1

tmp245-1898

[5]

[7]

tmp245-1899

Reduce 5

[8]

tmp245-1900

Reduce 4

tmp245-1901

[9]

[9]

tmp245-1902

[10]

[10]

tmp245-1903

Reduce 6

[HI

tmp245-1904

Accept

Consider the LR(l) sets of items given in Table 7-37. The item sets associated with, say, states 8 and 9 are very similar. In fact, the first components (or cores) of these two items are identical. The only difference is in their second components, that is, the lookahead sets. The sets of LR( 1) items for states 8 and 9 can be merged into one set yielding the item

tmp245-1905

where the lookahead symbols of both item sets have been combined into one set of symbols. Similarly, states 5 and 13 can be merged into one state. For convenience, we present C5 and C13 again as follows:

tmp245-1906

Note that the cores of the corresponding items in the two sets are the same. Therefore, the two sets can be combined into the following set:

tmp245-1907

Continuing this process, each pair of sets of items for states 2 and 7,12 and 16,10 and 14, and 11 and 15, can also be merged into one state. The resulting finite-state control is the same (in their first component) as that obtained by the LR(0) construction process. The only difference is that the items derived from the LR(1) sets of items have a second lookahead component.

An LALR( 1) parser obtained by the previous merging process behaves in a manner very similar to that of its Li?(l) counterpart. In some instances, however, an LALR(l) parser may perform a reduce action instead of signaling an error that would be detected by the corresponding LR(l) parser. Such an error will eventually be detected by the LALR(\) parser before any further input symbols are shifted onto the parse stack. More will be said about this point shortly.

Using the example grammar of the previous subsection, we will now construct its LALR( 1) parsing tables from the LR( 1) sets of items. The item sets in the LALR( 1) machine are derived by the merging of the LR(l) sets of items as follows:

tmp245-1908

where the symbol [/] denotes state / in the LALRil) machine. The collections of items for the finite control of the LALR( 1) parser appear in Table 7-40.

Consider some of the next-state transition entries in this table. In particular, let us examine state [2], which was obtained by merging states 2 and 7 in the original LR(l) machine. In that machine there is a transition from state 0 to state 2 on the symbol T\ also, there is a transition from state 4 to state.7 on the symbol T. These two transitions in the LALR(1) machine will be denoted as follows:

tmp245-1909

Similarly, there are transitions from state 2 to state 11 and from state 7 to state 15 on the symbol * in the original machine. Since states 11 and 15 in the LR(1) item sets have been merged into state [9] in the LALR(\) machine, the two aforementioned transitions can be merged into one transition from state [2] to state [9] on the symbol * in the LALR(\) machine. As another example, consider the states 8 and 9 in the LR(l) machine which have been merged into the state [7] in the LALR(\) machine. In the original machine there are transitions on the symbol / from state 5 to state 9, state 4 to state 8, and state 13 to state 8. Since original states 5 and 13 have been merged into state [5], there are transitions from state [5] to state [7] and from state [4] to state [7] on the symbol / in the LALR(Y) machine. It is apparent from this discussion that the next-state transitions depend only on the core of an item set and, consequently, the next-state transitions of merged sets can also be merged. The remaining next-state entries of Table 7-40 can be obtained in a similar fashioii.

The parsing functions for the LALR( 1) parser are derived by using Table 7-40 as input to Algorithm LR(1)_CONSTRUCTOR of the previous subsection. The application of this algorithm to the current item table yields the parsing functions given in Table 7-41. Since these functions are not multiple-valued, the given grammar is LALR{1). It is instructive for the reader to compare Tables 7-38 and 7-41 at this point and note their similarities.

The previous approach to obtaining parsing functions for an LALR( 1) grammar is summarized in the following high-level algorithm.

1. Obtain the sets of LR(1) items

tmp245-1910

2. Merge all the sets of LR(1) items that have the same core. Let this new collection of items be denoted by

tmp245-1911

Table 7-41 Parsing functions for an LALR(l) grammar

State

Action function, F

Next state or goto function, G

/

=

+

e

S E

T

/

- +

*

[0]

S

[11] [1]

[2]

[3]

[1]

S

S

[4] [5]

[2]

Rl

Rl

Rl

[9]

[3]

R 5

R5

R5

R2

[4]

s

[6]

[2]

[7]

[5]

s

[8]

[7]

[6]

S

Rl

[5]

[7]

R5

R5

R5

R5

[8]

R4

R4

S

R4

[9]

19]

s

[10]

[10]

R6

R6

R6

Rb

[11]

A

3. The next-state transition table for the merged machine is obtained in the following manner: Let Mp be the new item set obtained from the merging of the original LR(1) item sets, say,

tmp245-1912

wheretmp245-1913Recall that each item set

tmp245-1914has the same core. The next-state transition entry from item set Mp to some other item set Mq on a symbol X can be obtained from the next-state transition entries of the LR(1) machine.

4. The item sets and next-state transition information of the merged machine is used as input to Algorithm LR(1)_CONSTRUCTOR to derive the parsing functions F and G. If these functions are single-valued, the grammar is LALR(1); otherwise it is not.

Given an LR(l) grammar with its associated collection of LR(1) item sets, let us examine under what circumstances the previously described merging process can yield multiple-valued entries in the parsing tables; that is, an LR(l) grammar is not LALR( 1). Parsing conflicts (or local ambiguities) can be of either the shift-reduce or the reduce-reduce varieties. A shift-reduce conflict depends solely on the core of an LR( 1) item and not on its lookahead part. Consequently, shift-reduce conflicts cannot arise in an LALR( 1) parser that is derived from the collection of LR( 1) items of an LR(l) grammar. Reduce-reduce conflicts, however, can arise in the state-merging process which is followed in attempting to obtain an LALR( 1) parser. An example of such a phenomenon is left to the exercises.

Table 7-42 Error detection in an (a) /./?(!) parser and (b) LALR{\) parser

(a) LR(l) parser

Step

Stack contents

Input string

Action

Output

0

0

tmp245-1917

1

03

tmp245-1918

S3

2

02

tmp245-1919

5

3

01

tmp245-1920

R 3

3

4

015

tmp245-1921

S5

5

0159

tmp245-1922

59

6

error

tmp245-1923

(b) LALR( 1) parser

0

[0]

tmp245-1924

1

[0] [3]

tmp245-1925

S[3]

2

[01 [2]

tmp245-1926

R 5

5

3

[0] [1]

tmp245-1927

R 3

3

4

[0] [1] [5]

tmp245-1928

S[ 5]

5

[0] [1] [5] [7]

tmp245-1929

S[7]

6

[0] [1] [5] [8]

tmp245-1930

*5

5

7

[0] [1]

tmp245-1931

R4

4

8

error

tmp245-1932

Earlier in this subsection, mention was made that an LALR(l) parser may not detect an input error as quickly as its corresponding LR( 1) parser. Table 7-42 contains two traces in an attempt to parse the invalid string / + /. The first trace given is that of of an LR( 1) parser (with the parsing tables of Table 7-38). Observe that two reductions occur before an error is detected in the input string. The second trace represents an attempt by the LALR(l) parser to parse the given input string. Note that four reductions are performed here. The last two reductions (T -* f and E -* E + T) were not generated by the LR(1) parser. In general, LALR( 1) parsers may perform more reductions than LR(1) parsers before detecting errors.

The approach used in this subsection for generating an LALR( 1) parser from the collection of LR( 1) items has one major drawback. It requires the initial storage of a large collection of items. Another approach is to start with the LR(0) collection of items and generate the necessary lookahead information required by the LALR(T) parser. The next subsection examines this approach.

Efficient Generation of Lookahead Sets for LALR( 1) Parsers

This subsection formulates an algorithm for generating lookahead information for LALR{\) parsers. The lookahead sets are to be derived directly from the LR(0) machine associated with a given grammar.

Recall from Sec. 7-4.3 that inadequate states in SLR( 1) grammars can be resolved with information succeeding each inadequate state, that is, right context which is derived from FOLLOW set computations. The resolution of inadequate states in LALR(l) parsers, however, requires in general information both preceding and following each inadequate state, that is, both left and right context.

There have been several efforts to generate the lookahead sets for an LR(l) parser from the LR(0) machine of its associated grammar. Some of these efforts, however, have not generated the lookahead information for the set of LALR(\) grammars. For example, Lalonde (1971) and Barrett and Couch (1979) give algorithms for the generation of lookahead information for the set of SLALR( 1). This set of grammars is a superset of the class of SLR(l) grammars but a proper subset of the LALR(l) grammars. Anderson et al. (1973) have also formulated an algorithm for generating lookahead sets.

Aho and Ullman (1977) give an algorithm for generating the required lookahead information for an LALR( 1) grammar from its LR(0) machine. Their algorithm is essentially the one used in the compiler-compiler system YACC (Johnson, 1974).

The method that we outline in this subsection is due to DeRemer and Pennello (1979, 1982). Their method appears to be the most efficient way of generating LALR(l) lookahead information.

Each inadequate state in the LR(0) machine associated with a given grammar requires lookahead information for its resolution. Recall that there is at least one shift-reduce or reduce-reduce conflict in an inadequate state. The lookahead set for an inadequate state q, where a reduction involving the application of a rule A -* a may apply, is defined as

tmp245-1936

where "8a accesses q" means that starting from the start state of the machine the scanning of the string 8a will result in a sequence of state transitions, the last of which is state q. The approach to computing the lookahead sets is to express them in terms of intermediate relations.

This subsection first introduces notation which facilitates the formulation of an efficient algorithm for computing lookahead sets. We then formulate several relations which give the underlying structure of the lookahead sets. Finally, an algorithm for computing lookahead sets is given. The details of the method are given in DeRemer and Pennello (1982).

A transition from state p to state q on symbol X, denoted by (p, X), is shown astmp245-1937wheretmp245-1938if q is not important.

Given an LR(0) machine, we define

tmp245-1941

where Cq is the set of items associated with state q andtmp245-1942. Recall state q in an LR(0) parser is inadequate if and only if there exists antmp245-1943such that G(q, a) is defined and Reducetmp245-1946(a shift-reduce conflict) or Reduce (q, a) contains more than one rule (a reduce-reduce conflict), or both. In an LALR( 1) parser the previous definition can be reformulated as

tmp245-1948

In order to compute the lookahead sets, it is useful to concentrate on nonterminal transitions and define their FOLLOW sets.

tmp245-1949

Intuitively, a FOLLOW set specifies the terminals that can immediately follow the nonterminal A in a sentential form tvhere prefix 8 accesses state p. It turns out that a lookahead set is the union of associated FOLLOW sets. It can be shown that

tmp245-1950

wheretmp245-1951denotes a sequence of single transitionstmp245-1952

tmp245-1953In words, the lookahead set for the ruletmp245-1954in state q is the set union of the FOLLOW sets for the transitions on A from some state p, which on "reading" a, will terminate in state q. Alternatively, when the ruletmp245-1955is applied in state q and |a| states are popped off the parse stack, a state p appears on top of the stack which reads A with a lookahead symbol of a. Figure 7-13 exhibits this situation. When in state q with the next input symbol a that belongs to any of the follow sets FOLLGW(/>(, A) fortmp245-1956the parser will apply rule A -» a. States px through pm serve to remember left context. The following relation is useful in capturing the nonterminal transitions (j?,, A):

tmp245-1963

Therefore,

tmp245-1964Look-ahead set in terms of FOLLOW sets.

Figure 7-13 Look-ahead set in terms of FOLLOW sets.

Throughout the discussion, we will exhibit various relations for the grammar whose rules are:

tmp245-1966

The LR(0) machine for this grammar was given in Fig. 7-10. The set of nonterminal transitions for this machine is

tmp245-1967

The relation LOOKBACK consists of the following ordered pairs:

tmp245-1968

The FOLLOW sets are interrelated. More specifically, it can be shown that

tmp245-1969

Pictorially, the situation is depicted in Fig. 7-14. For a given string 8 which accesses state p’, the string SA accesses state p. Furthermore, in a proper right context, 5A/4 is first reduced to 8\A6 since 6 =» t and then to SB. The inclusion relation states that those symbols that can follow B in state p’ can also follow A

Interrelationships among FOLLOW sets.

Figure 7-14 Interrelationships among FOLLOW sets.

in state p. This inclusion can be more easily specified by defining a new relation on nonterminal transitions.

tmp245-1971

Therefore,tmp245-1972

Note that X in the current discussion can be the empty string.

Returning to our example, the INCLUDES relation is the following set of ordered pairs:

tmp245-1974

The first ordered pair is due to the ruletmp245-1975The ruletmp245-1976 accounts for the generation of the next two ordered pairs. The productiontmp245-1977 is responsible for the last two ordered pairs (X is empty in this case).

We next define Read( p, A) to be the set of terminal symbols that can be read before phrases that contain A can be reduced. The following diagram illustrates this situation:

tmp245-1981

Observe that because of nullable nonterminals there can be several reductions before the next input symbol is read. If there are no empty productions, the case becomes

tmp245-1982

and Read(/>, A) are the symbols that can be directly read. The following theorem specifies how the FOLLOW sets are computed:

tmp245-1983

The set Read(/>, A) is computed in the following manner:

tmp245-1984

where DR and READS are defined as

tmp245-1985

The DR function specifies those symbols that can be directly read from the successor state t of the transition from p to ( on A. Symbols can be indirectly read when nullable nonterminals can follow the nonterminal A. For example, in

tmp245-1986

Therefore,tmp245-1987

Returning to the current example, the function DR is obtained easily from the LR(0) machine as follows:

tmp245-1989

Since the example grammar does not contain any empty productions, the READS relation is empty. Consequently,

tmp245-1990

The FOLLOW set computations for our example can now be performed.

tmp245-1991

Using the LOOKBACK relation, the LA sets can be finally computed. These computations are as follows:

Structural representation of the READS relation.

Figure 7-15 Structural representation of the READS relation.

tmp245-1993

Recall that state 2 in the machine of Fig. 7-10 is inadequate. However, since the lookahead sets for the two rules giving rise to a reduce-reduce conflict are disjoint, the inadequate state is resolved.

In summary, the computations are:

tmp245-1994

Using relations we have:

tmp245-1995

where INCLUDES* and READS* represent the reflexive transitive closures of the INCLUDES and READS relations, respectively, and

tmp245-1996

A general algorithm for computing the lookahead sets contains the following steps.

1. Compute the set of nullable nonterminals.

2. Obtain the DR relation from the LR(0) machine.

3. Compute the READS relation from the LR(0) machine.

4. Obtain the Read sets.

5. Compute the INCLUDES and LOOKBACK relations.

6. Obtain the FOLLOW sets.

7. Generate the LA sets by taking the union of the FOLLOW sets.

8. Check to see that all inadequate states are resolved; if so, an LALR( 1) parser can be obtained.

DeRemer and Pennello (1982) have formulated an efficient algorithm for computing the Read and FOLLOW sets in steps 4 and 6 of the previous algorithm, respectively. Their algorithm has an input a set X, a relation R, and a set-valued function F’; and it produces as output the set-valued function F as follows:

tmp245-1997

The relation R induces a directed graph G = (X, R), where X and R denote the set of nodes (or vertices) and edge set in the graph, respectively. F(x) is efficiently computed by traversing the graph G in an appropriate manner. The following algorithm computes the strongly connected components of the graph and generates the set-valued function F.

Procedure DIGRAPH(X, R, F’ F). Given X, R, and F’ as just described, this procedure generates the set-valued function F. STACK is a stack that contains elements of X and is initially empty. N is a vector of integers which is indexed by elements of X with each of the elements initialized to zero.

tmp245-1998

Procedure TRAVERSE(x). Given a vertex of a graph, this procedure computes the function F. Stack routines PUSH and POP are assumed. TOPV is a function that returns the top element of the stack. The function MIN returns the smallest of its two arguments. DEPTH is a function which returns the number of items on the stack. STACK, N, X, and R are global and are as previously described. The variables d, ELEMENT, and y are local variables.

tmp245-1999

As mentioned earlier, this algorithm is invoked twice in computing lookahead sets. In computing Read sets (step 4 of the general algorithm) X is ‘he set of nonterminal transitions, F’ is DR, and R is the READS relation. The output F is Read. In the second invocation of Procedure DIGRAPH, X is again the set of nonterminal transition, F’ is Read, and R is the relation INCLUDES. The procedure in this case generates the set-valued function FOLLOW. Procedure DIGRAPH can also check for cycles (due to nontrivial strongly connected components). If such a component is detected, the grammar may not be LALR(Y).

DeRemer and Pennello have performed a comparison of their lookahead generation technique with that of YACC. The result of this comparison indicates that their method is significantly more efficient than YACC’s.

Next post:

Previous post: