Bottom-Up Parsing (Compiler Writing) Part 8

LR GRAMMARS

This section deals with the construction of deterministic parsers for a class of grammars called LR(k) grammars. This class of grammars is essentially the set of all unambiguous context-free grammars and properly contains other previously encountered classes of grammars such as LL(k) and precedence grammars. An LR(k) parser scans a given input string from left to right and constructs, as was done in a simple precedence parser, the reverse of a rightmost derivation for that string. An LR(k) parser makes all parsing decisions based on the contents of a parse stack and the next k lookahead symbols in an input string. Besides being efficient and general, an LR(k) parser also possesses the "valid prefix property" (see Sec. 5-2.3). Recall that a parser having this property can detect an error in the input string at the earliest possible time. These parsers, however, are more complicated than some of the other kinds of parsers encountered thus far. LR(k) parsers are tedious to construct. For more than one symbol of lookahead (i.e., k > 1), the LR approach tends to be too expensive. Consequently, this section concentrates on the development of parsers that iise at most one lookahead symbol. An LR(0) parser does not require any lookahead symbols in order to deterministically parse a language. Because of this simplicity, and the fact that an unsuccessful attempt at its construction can be the starting point or first step in constructing other more general parsers, we discuss LR(0) parsers in detail. Other parsers based on one symbol of lookahead follow in order of increasing strength: SLR( 1), LALR(l), and LR( 1). We next present some storage and optimization considerations in building these parsers. The section concludes with a discussion of the aspects of error detection and recovery in LR parsers.


Concepts and Terminology

An LR parser constructs the reverse of a rightmost derivation for a given input string. Consider a grammar G whose starting symbol is S. Fof an input string x, its rightmost derivation is given by

tmp245-1295_thumb

where the rightmost nonterminal in each a„ for 1 < / < m, is the one selected to be rewritten (see Sec. 2-4.2). A representative step in this derivation would have the form

tmp245-1296_thumb

wheretmp245-1297_thumbis the production that has been applied. Observe that, since we are concerned with a rightmost derivation, t must be a string of terminal symbols.

A grammar is said to be LR(k) if, for any given input string, at each step of any derivation the handle /3 can be detected by examining the string <f>/? and scanning at most the first k symbols of the unexpended input string t.

We now define a term which is central to the discussion of LR parsers. A viable prefix of a sentential form <#>/?r, where /? denotes the handle, is any prefix or head string of <j>/i. That is, iftmp245-1298_thumb, a stringtmp245-1299_thumb, wheretmp245-1300_thumb is a viable prefix of the sentential formtmp245-1301_thumbNote that a viable prefix cannot contain symbols to the right of the handle (i.e., symbols in t).

As an example, consider the grammar with the following productions:

tmp245-1307_thumb

The production S -* E# is a padding production which ensures that S is the left part of only one production. This production is useful in determining when to halt in a parsing algorithm. Consider the sentential form .

Knuth (1965), in his important paper dealing with LR(k) grammars, showed that the set of all viable prefixes of all rightmost derivations associated with a particular grammar can be recognized by a finite-state machine (see Chap. 4). This machine becomes the finite-control part of a parser called an LR parser. Observe that, in general, a useful grammar has an infinite number of associated viable prefixes. An important aspect in obtaining an LR parser is to construct (automatically) the finite-state machine directly from the productions of the grammar. The question that arises is: How can we accomplish this from an infinite number of viable prefixes? We will return to this question in the next subsection.

In the remainder of this subsection we assume that a finite control for a parser already exists. The set of all viable prefixes associated with a grammar is partitioned among the states in the finite-state machine that recognize these viable prefixes. Unless otherwise stated, we concentrate on LR parsers that need at most one symbol lookahead in the input string, that is, LR(0) and LR( 1) parsers.

An LR parser, like most other types of parsers, is a pushdown machine that has, as input, a given string; a stack, and a finite-control mechanism. This

tmp245-1313_thumbtmp245-1314_thumb

mechanism is a finite-state machine with, in general, many states. Figure 7-7 contains a finite control for the example grammar given earlier. This machine contains 12 states, with state 0 as its initial or start state. Two types of states are present: read states and apply or reduce states. A read state causes a transition from one state to another to occur on reading a terminal or nonterminal symbol. States 0, 1, 4, 6, 7, and 10 are all read states. A reduce or apply state recognizes the handle part of a viable prefix associated with the current sentential form. The result of being in such a reduce state is to reduce the detected handle to a nonterminal symbol. As mentioned earlier, a state can be associated with an infinite number of viable prefixes. For example, in Fig. 7-7 state 4 can be reached from state 0 with the viable prefixes:tmp245-1315_thumb

Similarly, any viable prefix whose tail symbol is i will reach state 3. Similar observations can be made about other states in the finite-state machine. Note that the machine in Fig. 7-7 is deterministic.

An LR(0) parser maintains a stack of symbol pairs. The first element in each pair is a symbol of the vocabulary. The second element in the pair denotes the state entered on scanning a vocabulary symbol. The stack initially contains state 0, the start state of the finite control associated with the parser. When the parser is in a read state, it stacks the current symbol being scanned and performs the indicated transition based on the symbol read. This subsequently entered state is also stacked. When the parser enters an apply or reduce state which is associated with a production B -* /?, it pops the top 2* |/?| symbols off the stack. If the goal symbol is reached at this point (i.e., B = S) the parser accepts the given input string and halts. Otherwise, the state which appears at the top of the stack after the reduction and the left part of the rule that applies determine the next transition state. We then stack both B and this next state. Note that since the finite control associated with the parser is deterministic, the grammar symbols need not be stacked.

Parser for an LR(0) grammar.

Figure 7-7 Parser for an LR(0) grammar.

We do so only to make the discussion more understandable. Also observe that on reaching a reduce or apply state, the entire handle is known at that point. Unlike precedence parsers, there is no need to enter the stack to find the head symbol of the handle.

A trace of the informal parsing strategy for the given input string, i -(/ 4- /)#, appears in Table 7-31. Initially, the stack contains state 0 and the current input symbol is i. The finite control causes a transition from read state 0 to state 3 on reading an i. The symbol pair, i and state 3, is stacked at this point. Since state 3 is a reduce state and the production T -» / applies, the symbol pair at the top of the stack is discarded. This deletion operation brings state 0 to the top of the stack. Based on state 0 and the nonterminal T, the next transition is to state 2. We now stack the symbol pair, T and state 2. The stack content becomes 0 T2 Observe that one step in the reverse of the rightmost derivation has been obtained. That is,

tmp245-1318_thumb

Continuing the trace, we are now in state 2, which is another reduce state. At this time, production E -* T applies. Consequently, the top symbol pair in the stack, T and state 2, is discarded. State 0 again appears at the top of the stack. This state and E causes a transition to state 1. Therefore, the symbol pair, E and state 1, is pushed on the stack, thus giving the following stack contents:

tmp245-1392_thumb

Table 7-31 Trace of LR(0) parser for the string tmp245-1319_thumb

Step

Stack contents

Input string

Viable prefix

Handle

0

tmp245-1320 tmp245-1321 tmp245-1322 tmp245-1323

1

tmp245-1324 tmp245-1325 tmp245-1326 tmp245-1327

2

tmp245-1328 tmp245-1329 tmp245-1330 tmp245-1331

3

tmp245-1332 tmp245-1333 tmp245-1334 tmp245-1335

4

tmp245-1336 tmp245-1337 tmp245-1338 tmp245-1339

5

tmp245-1340 tmp245-1341 tmp245-1342 tmp245-1343

6

tmp245-1344 tmp245-1345 tmp245-1346 tmp245-1347

7

tmp245-1348 tmp245-1349 tmp245-1350 tmp245-1351

8

tmp245-1352 tmp245-1353 tmp245-1354 tmp245-1355

9

tmp245-1356 tmp245-1357 tmp245-1358 tmp245-1359

10

tmp245-1360 tmp245-1361 tmp245-1362 tmp245-1363

11

tmp245-1364 tmp245-1365 tmp245-1366 tmp245-1367

12

tmp245-1368 tmp245-1369 tmp245-1370 tmp245-1371

13

tmp245-1372 tmp245-1373 tmp245-1374 tmp245-1375

14

tmp245-1376 tmp245-1377 tmp245-1378 tmp245-1379

15

tmp245-1380 tmp245-1381 tmp245-1382 tmp245-1383

16

tmp245-1384 tmp245-1385 tmp245-1386 tmp245-1387

17

tmp245-1388 tmp245-1389 tmp245-1390 tmp245-1391

The next step in the reverse of the rightmost derivation is complete. That is,

tmp245-1393_thumb

The next transition is from state 1 to state 7 because the current input symbol is — The minus sign and state 7 is the next symbol pair to be stacked. The stack content becomes

tmp245-1394_thumb

The new current input symbol is (. Based on this symbol, a state transition occurs from state 7 to state 4 with symbol pair ( and state 7 being pushed on the stack, giving a new stack content of

tmp245-1395_thumb

The next input symbol is now /’. A transition from state 4 to state 3 occurs on i. The pushing of the new symbol pair, / and state 3, on the stack yields

tmp245-1396_thumb

Since state 3 is a reduce state, the rule T -* i applies. The top symbol pair on the stack is discarded. A transition from state 4 to state 2 then occurs on T. The new symbol pair, T and state 2, is pushed on the stack, with its contents now becoming

tmp245-1397_thumb

The new step in the parse that has been obtained is

tmp245-1398_thumb

The remaining steps of the trace can be obtained in an analogous manner. On reaching state 5, an apply state, the starting symbol of the grammar (S) is recognized after performing the indicated reduction. At this point the parser accepts the given input string and halts.

Note that LR parsers use a somewhat different approach to parsing than do precedence parsers. Both types of parsers detect the tail of a handle by looking at the top entry of the parse stack and the current input symbol. Precedence parsers, with the use of precedence relations, must search the stack for the head of the handle. LR parsers, however, make that decision based on the top entry of the stack and the current input symbol. The parse stack in an LR parser is more complex than the corresponding parse stack in a precedence parser. The top entry of the stack in any LR parser essentially summarizes all the information about the parse up to that point in the parsing process.

The existence of the finite control for the ZJ?(0) parser just discussed was assumed. The next subsection deals with the construction of such a finite control from a given grammar.

LR(0) Parsers

This subsection describes how to construct, from the productions of a given LR(0) grammar, a parser for that grammar. Although few useful grammars are LR(0), the construction of LR(0) parsers can be the basis of construction for more complicated parsers such as the class of SLR(l) parsers. LR(0) parsers require no lookahead from the unexpended input string in order to make all parsing decisions.

Recall from the last subsection that the finite control associated with an LR(0) parser consisted of a finite number of states. Each state was associated with the set of viable prefixes that was recognized by that state. Since the number of viable prefixes associated with a particular state is often infinite, it is difficult to use the sets of viable prefixes to construct the finite control of a parser. Although the notion of a viable prefix is intuitively useful in talking about ah LR parser, it is of little direct use in obtaining the parser itself. In order to determine the states of the finite control for a parser, we want to associate each state with a finite set. Such a finite set can be defined by using the notion of a viable prefix.

Although an LR parser recognizes viable prefixes, its principal goal is to output the sequence of productions that was used in constructing the reverse rightmost derivation for a given input string. From this point of view, the parser wants to detect handles. In order to detect a handle (the right part of a production), however, a parser must first recognize certain parts or prefixes of that handle. This observation and the previous comment on infinite sets of viable prefixes lead us to the following definition.

Definition 7-7 An item or a configuration of a given grammar G is a marked production of the form

tmp245-1403_thumb

where

tmp245-1404_thumb

is a production of G and the period or dot denotes the mark.

The brackets help distinguish an item from its associated production. In general, there are several items or configurations associated with the same > production. For example, the productiontmp245-1405_thumbhas the following four associated items:

tmp245-1407_thumb

In order to construct the finite control of an LR parser, we want to associate items with viable prefixes. Therefore, an itemtmp245-1408_thumb"is valid for some viable prefix </>«! if and only if there exists some rightmost derivation

tmp245-1410_thumb

in which t is a string of terminal symbols. In general, there may be many valid items for a viable prefix. For example, in the LR(0) grammar of the previous subsection which contained the productions

tmp245-1411_thumb

the items

tmp245-1412_thumb

are all valid for the viable prefix E – . The rightmost derivation

tmp245-1413_thumb

shows that the item „tmp245-1414_thumb, is valid for E – . In this case, by using the previous definition of valid item, we havetmp245-1415_thumb Similarly, the rightmost derivation

tmp245-1418_thumb

shows thattmp245-1419_thumbis also valid. In this case,tmp245-1420_thumband tmp245-1421_thumbFinally, the rightmost derivation

tmp245-1425_thumb

demonstrates that the itemtmp245-1426_thumbis valid. These observations can also be mverified by examining the finite control of the parser given in Fig. 7-7. Since a grammar has a finite number of productions, it follows that the number of valid items associated with a viable prefix will also be finite.

We want to associate a finite set of valid items with each state in a parser’s finite-control mechanism. Since each viable prefix has a finite number of valid items, it follows that a potentially infinite set of viable prefixes (representing one state of the parser) will also have only a finite number of associated valid items. Consequently, if we can obtain the finite set of valid items associated with a state of the parser, the finite-state control of the parser can be derived. This is the approach that we will follow. The machine of Fig. 7-7 will be constructed using the notion of a valid item.

We now, in an informal manner, proceed to construct the sets of items associated with an LR(0) parser. To get the construction process started, the item tmp245-1427_thumbis associated with the start state of the finite-state control, where S denotes the distinguished symbol of the grammar. The itemtmp245-1428_thumbreflects the fact that the parser has not recognized any (nonempty) viable prefixes. For the current example, state 0 is associated with the itemtmp245-1429_thumb

A closure or completion operation is required on the initial item in the item set. To obtain the closure of an itemtmp245-1430_thumbwhere X is a nonterminal, we include in the item set all items of the formtmp245-1431_thumbIn the example, the closure of the itemtmp245-1432_thumbfirst generates the itemstmp245-1433_thumband tmp245-1434_thumbNote that using the previous notation,tmp245-1435_thumb andtmp245-1436_thumbApplying the closure operation to the itemtmp245-1437_thumbyields two more items:tmp245-1438_thumbRepeating the closure operation on any of the items obtained so far does not produce any new items. The initial item set C0 associated with the start state of the parser is

tmp245-1452_thumb

Intuitively, the parser in state 0 expects to encounter a string derivable from £#. Strings derivable from E are those which are derivable from either T, E + T, or E — T. The strings derivable from T will begin with either an /’ or a (. The first item in C0 is the basis item. The remaining items in C0 are the result of performing the closure or completion operation.

Thus far, the machine does not recognize symbols in a viable prefix. To have a machine recognize symbols, we need to define a new operation called a read or successor operation. Let an item of the formtmp245-1453_thumbbe associated with some state U, where X denotes any symbol in the grammar vocabulary. A read or successor operation associates the itemtmp245-1454_thumbwith some state V in the machine. The read operation defines a transition from state U to state V on scanning the symbol X. Recall from the previous subsection that transitions in the finite control of an LR parser occur when an input symbol is scanned or the parser is in a reduce state. The read operation is instrumental in the creation of new states in the construction process.

In the current example, a read operation on the nonterminal E results in the creation of three new items associated with a new state, say, state 1. These items, which form the basis set of the current state, are the following:

tmp245-1457_thumb

The completion or closure of these items yields no new items, since in each case a terminal symbol follows the dot. Let C\ denote the new item set.

Similarly, a read operation or transition from state 0 to state 2 on nonterminal T yields a basis itemtmp245-1458_thumbThe closure of this item also yields no new items. We call the item set C2. A similar situation occurs when performing a transition from state 0 to state 3 on the terminal /’, thus yielding the itemtmp245-1459_thumb C3 denotes this new item set. A fourth (and final) transition from state 0 to state 4 occurs on reading a left parenthesis. The basis set of state 4 is the item tmp245-1460_thumbThe closure operation on this set, however, generates the following new items:

tmp245-1464_thumb

The six items form the item set C4. The portion of the machine constructed thus far is illustrated in Fig. 7-8.

Intuitively, state 1 has recognized an E and the next symbol expected is #, +, or -. State 3 recognizes a handle (/’) and is a reduce state. Similarly, state 2 is a reduce state which recognizes a handle (T). Finally, state 4 reflects the fact that a left parenthesis was scanned. The next symbols to be encountered are those derivable from an expression (E). The symbols derivable from an expression are those derivable from a term (T).

Continuing the machine-generation process, we can perform a transition from state 1 to state 5 on reading the symbol #. The item set for state 5 is tmp245-1465_thumband denotes another reduce state. Generating a new basis set of items through a read operation and then performing a completion operation on that set yields another collection of items associated with some perhaps new state.

Partial construction of an LR(0) machine.

Figure 7-8 Partial construction of an LR(0) machine.

Observe at this point that it is possible to generate a basis set of items which has already been associated with a previously generated state. If such a "duplicate" state occurs, it is simply ignored. The process of creating a new state by performing a read (or transition) operation and subsequent completions of the basis items must terminate. Since each set of items is finite, the machine must have a finite number of states. Continuing the state-generation process just described for the current grammar yields the machine given earlier in Fig. 7-7. The sets of items associated with each state of the machine are summarized in Table 7-32.

Before giving an algorithm for constructing the finite control of an LR(0) parser, we summarize the operations required for its construction. We want to generate the item sets associated with each state of the parser.

Table 7-32 The items sets for an LR(0) parser

State 0:

tmp245-1468

State 5:

tmp245-1469
tmp245-1470

State 6:

tmp245-1471
tmp245-1472 tmp245-1473
tmp245-1474 tmp245-1475
tmp245-1476

State 7:

tmp245-1477
tmp245-1478 tmp245-1479

State 1:

tmp245-1480 tmp245-1481
tmp245-1482

State 8:

tmp245-1483
tmp245-1484

State 9:

tmp245-1485

State 2:

tmp245-1486

State 10:

tmp245-1487

State 3:

tmp245-1488 tmp245-1489

State 4:

tmp245-1490 tmp245-1491
tmp245-1492

State 11:

tmp245-1493
tmp245-1494 tmp245-1495
tmp245-1496 tmp245-1497
tmp245-1498 tmp245-1499
tmp245-1500 tmp245-1501

The three following rules are used to construct the finite-state control of the parser:

1. Start operation: Lettmp245-1502_thumbbe a rule in grammar G where S denotes the starting symbol of the grammar. Then the itemtmp245-1503_thumbis associated with the start state. This operation gets the construction process started: The start state, in general, will eventually contain several items.

2. The closure (or completion) operation: Iftmp245-1504_thumbis an item associated with a state U, where X is a nonterminal symbol, then each item of the formtmp245-1505_thumbmust also be associated with state U. This operation is applied repeatedly until no more items are associated with that state.

3. The read (or successor) operation: Let X be a vocabulary symbol in an itemtmp245-1506_thumbassociated with some state U. Then the itemtmp245-1507_thumbis associated with a transition to state V on reading symbol X. Note that U and V can be the same state (see state 4 in Fig. 7-7).

These operations are used in the following algorithm, which generates the finite control for an LR(0) parser.

Algorithm LR(0)_MACHINE. Given a grammar G, with starting symbol S, this algorithm generates the finite control of the LR(0) parser for the given grammar. The machine contains the following sets of items:

tmp245-1514_thumb

where C0 is the initial or start item set. The states of the LR(0) parser are

tmp245-1515_thumb

where each state j is constructed or obtained from the item set Cr The algorithm generates the collection of items C.

1. [Generate the starting item set]

(a) Assign the starting item set a subscript (say, 0) and then place the itemtmp245-1516_thumbin the set C0.

(b) Perform the completion operation on the item, i.e., look for a nonterminal symbol X which follows the dot ".", and include the items of the formtmp245-1517_thumbin the set wheretmp245-1518_thumbis a production in G. The completion operation is also performed on all new items derived.

(c) Call the set of items obtained in parts a and b the set C0.

2. [Generate the sets of items C]

Repeat through step 4 until no more distinct item sets occur.

3. [Perform a read operation on an item]

(a) Perform a read operation on an item (initially, in set C0). Start a new state set, say, Cr

(b) If this item set is already there, ignore it. At any rate, obtain a new basis item set or exit.

4. [Perform the completion of the new state set]

Perform the completion operation on the basis set of items for Cr

The algorithm generates the item sets from which a machine, such as that given in Fig. 7-7, can be derived. As mentioned earlier, there are two kinds of states in an LR(0) machine, read states and reduce (or apply) states. A reduce state is associated with a completed item, i.e., one in which the dot is at the right end of the item. Such a state indicates that a handle has been detected. An LR(0) parser with a finite control can be described by a pair of tables, one table to indicate the action taken by the parser such as accept, reduce, push, and invalid, and the other to indicate the transition from one state to another. We return to the specifics of these tables later.

An LR(0) grammar (see exercises) is clearly not sufficiently powerful to express the constructs contained in current programming languages. These languages require at least one symbol lookahead in order to be parsed deterministi-cally. The generation of an LR (0) machine, however, can be a starting point for obtaining an LR parser based on one symbol of lookahead. We develop such a simple class of parsers in the next subsection.

Next post:

Previous post: