LR GRAMMARS
This section deals with the construction of deterministic parsers for a class of grammars called LR(k) grammars. This class of grammars is essentially the set of all unambiguous context-free grammars and properly contains other previously encountered classes of grammars such as LL(k) and precedence grammars. An LR(k) parser scans a given input string from left to right and constructs, as was done in a simple precedence parser, the reverse of a rightmost derivation for that string. An LR(k) parser makes all parsing decisions based on the contents of a parse stack and the next k lookahead symbols in an input string. Besides being efficient and general, an LR(k) parser also possesses the "valid prefix property" (see Sec. 5-2.3). Recall that a parser having this property can detect an error in the input string at the earliest possible time. These parsers, however, are more complicated than some of the other kinds of parsers encountered thus far. LR(k) parsers are tedious to construct. For more than one symbol of lookahead (i.e., k > 1), the LR approach tends to be too expensive. Consequently, this section concentrates on the development of parsers that iise at most one lookahead symbol. An LR(0) parser does not require any lookahead symbols in order to deterministically parse a language. Because of this simplicity, and the fact that an unsuccessful attempt at its construction can be the starting point or first step in constructing other more general parsers, we discuss LR(0) parsers in detail. Other parsers based on one symbol of lookahead follow in order of increasing strength: SLR( 1), LALR(l), and LR( 1). We next present some storage and optimization considerations in building these parsers. The section concludes with a discussion of the aspects of error detection and recovery in LR parsers.
Concepts and Terminology
An LR parser constructs the reverse of a rightmost derivation for a given input string. Consider a grammar G whose starting symbol is S. Fof an input string x, its rightmost derivation is given by
where the rightmost nonterminal in each a„ for 1 < / < m, is the one selected to be rewritten (see Sec. 2-4.2). A representative step in this derivation would have the form
whereis the production that has been applied. Observe that, since we are concerned with a rightmost derivation, t must be a string of terminal symbols.
A grammar is said to be LR(k) if, for any given input string, at each step of any derivation the handle /3 can be detected by examining the string <f>/? and scanning at most the first k symbols of the unexpended input string t.
We now define a term which is central to the discussion of LR parsers. A viable prefix of a sentential form <#>/?r, where /? denotes the handle, is any prefix or head string of <j>/i. That is, if, a string, where is a viable prefix of the sentential formNote that a viable prefix cannot contain symbols to the right of the handle (i.e., symbols in t).
As an example, consider the grammar with the following productions:
The production S -* E# is a padding production which ensures that S is the left part of only one production. This production is useful in determining when to halt in a parsing algorithm. Consider the sentential form .
Knuth (1965), in his important paper dealing with LR(k) grammars, showed that the set of all viable prefixes of all rightmost derivations associated with a particular grammar can be recognized by a finite-state machine (see Chap. 4). This machine becomes the finite-control part of a parser called an LR parser. Observe that, in general, a useful grammar has an infinite number of associated viable prefixes. An important aspect in obtaining an LR parser is to construct (automatically) the finite-state machine directly from the productions of the grammar. The question that arises is: How can we accomplish this from an infinite number of viable prefixes? We will return to this question in the next subsection.
In the remainder of this subsection we assume that a finite control for a parser already exists. The set of all viable prefixes associated with a grammar is partitioned among the states in the finite-state machine that recognize these viable prefixes. Unless otherwise stated, we concentrate on LR parsers that need at most one symbol lookahead in the input string, that is, LR(0) and LR( 1) parsers.
An LR parser, like most other types of parsers, is a pushdown machine that has, as input, a given string; a stack, and a finite-control mechanism. This
mechanism is a finite-state machine with, in general, many states. Figure 7-7 contains a finite control for the example grammar given earlier. This machine contains 12 states, with state 0 as its initial or start state. Two types of states are present: read states and apply or reduce states. A read state causes a transition from one state to another to occur on reading a terminal or nonterminal symbol. States 0, 1, 4, 6, 7, and 10 are all read states. A reduce or apply state recognizes the handle part of a viable prefix associated with the current sentential form. The result of being in such a reduce state is to reduce the detected handle to a nonterminal symbol. As mentioned earlier, a state can be associated with an infinite number of viable prefixes. For example, in Fig. 7-7 state 4 can be reached from state 0 with the viable prefixes:
Similarly, any viable prefix whose tail symbol is i will reach state 3. Similar observations can be made about other states in the finite-state machine. Note that the machine in Fig. 7-7 is deterministic.
An LR(0) parser maintains a stack of symbol pairs. The first element in each pair is a symbol of the vocabulary. The second element in the pair denotes the state entered on scanning a vocabulary symbol. The stack initially contains state 0, the start state of the finite control associated with the parser. When the parser is in a read state, it stacks the current symbol being scanned and performs the indicated transition based on the symbol read. This subsequently entered state is also stacked. When the parser enters an apply or reduce state which is associated with a production B -* /?, it pops the top 2* |/?| symbols off the stack. If the goal symbol is reached at this point (i.e., B = S) the parser accepts the given input string and halts. Otherwise, the state which appears at the top of the stack after the reduction and the left part of the rule that applies determine the next transition state. We then stack both B and this next state. Note that since the finite control associated with the parser is deterministic, the grammar symbols need not be stacked.
Figure 7-7 Parser for an LR(0) grammar.
We do so only to make the discussion more understandable. Also observe that on reaching a reduce or apply state, the entire handle is known at that point. Unlike precedence parsers, there is no need to enter the stack to find the head symbol of the handle.
A trace of the informal parsing strategy for the given input string, i -(/ 4- /)#, appears in Table 7-31. Initially, the stack contains state 0 and the current input symbol is i. The finite control causes a transition from read state 0 to state 3 on reading an i. The symbol pair, i and state 3, is stacked at this point. Since state 3 is a reduce state and the production T -» / applies, the symbol pair at the top of the stack is discarded. This deletion operation brings state 0 to the top of the stack. Based on state 0 and the nonterminal T, the next transition is to state 2. We now stack the symbol pair, T and state 2. The stack content becomes 0 T2 Observe that one step in the reverse of the rightmost derivation has been obtained. That is,
Continuing the trace, we are now in state 2, which is another reduce state. At this time, production E -* T applies. Consequently, the top symbol pair in the stack, T and state 2, is discarded. State 0 again appears at the top of the stack. This state and E causes a transition to state 1. Therefore, the symbol pair, E and state 1, is pushed on the stack, thus giving the following stack contents:
Table 7-31 Trace of LR(0) parser for the string
Step |
Stack contents |
Input string |
Viable prefix |
Handle |
0 |
||||
1 |
||||
2 |
||||
3 |
||||
4 |
||||
5 |
||||
6 |
||||
7 |
||||
8 |
||||
9 |
||||
10 |
||||
11 |
||||
12 |
||||
13 |
||||
14 |
||||
15 |
||||
16 |
||||
17 |
The next step in the reverse of the rightmost derivation is complete. That is,
The next transition is from state 1 to state 7 because the current input symbol is — The minus sign and state 7 is the next symbol pair to be stacked. The stack content becomes
The new current input symbol is (. Based on this symbol, a state transition occurs from state 7 to state 4 with symbol pair ( and state 7 being pushed on the stack, giving a new stack content of
The next input symbol is now /’. A transition from state 4 to state 3 occurs on i. The pushing of the new symbol pair, / and state 3, on the stack yields
Since state 3 is a reduce state, the rule T -* i applies. The top symbol pair on the stack is discarded. A transition from state 4 to state 2 then occurs on T. The new symbol pair, T and state 2, is pushed on the stack, with its contents now becoming
The new step in the parse that has been obtained is
The remaining steps of the trace can be obtained in an analogous manner. On reaching state 5, an apply state, the starting symbol of the grammar (S) is recognized after performing the indicated reduction. At this point the parser accepts the given input string and halts.
Note that LR parsers use a somewhat different approach to parsing than do precedence parsers. Both types of parsers detect the tail of a handle by looking at the top entry of the parse stack and the current input symbol. Precedence parsers, with the use of precedence relations, must search the stack for the head of the handle. LR parsers, however, make that decision based on the top entry of the stack and the current input symbol. The parse stack in an LR parser is more complex than the corresponding parse stack in a precedence parser. The top entry of the stack in any LR parser essentially summarizes all the information about the parse up to that point in the parsing process.
The existence of the finite control for the ZJ?(0) parser just discussed was assumed. The next subsection deals with the construction of such a finite control from a given grammar.
LR(0) Parsers
This subsection describes how to construct, from the productions of a given LR(0) grammar, a parser for that grammar. Although few useful grammars are LR(0), the construction of LR(0) parsers can be the basis of construction for more complicated parsers such as the class of SLR(l) parsers. LR(0) parsers require no lookahead from the unexpended input string in order to make all parsing decisions.
Recall from the last subsection that the finite control associated with an LR(0) parser consisted of a finite number of states. Each state was associated with the set of viable prefixes that was recognized by that state. Since the number of viable prefixes associated with a particular state is often infinite, it is difficult to use the sets of viable prefixes to construct the finite control of a parser. Although the notion of a viable prefix is intuitively useful in talking about ah LR parser, it is of little direct use in obtaining the parser itself. In order to determine the states of the finite control for a parser, we want to associate each state with a finite set. Such a finite set can be defined by using the notion of a viable prefix.
Although an LR parser recognizes viable prefixes, its principal goal is to output the sequence of productions that was used in constructing the reverse rightmost derivation for a given input string. From this point of view, the parser wants to detect handles. In order to detect a handle (the right part of a production), however, a parser must first recognize certain parts or prefixes of that handle. This observation and the previous comment on infinite sets of viable prefixes lead us to the following definition.
Definition 7-7 An item or a configuration of a given grammar G is a marked production of the form
where
is a production of G and the period or dot denotes the mark.
The brackets help distinguish an item from its associated production. In general, there are several items or configurations associated with the same > production. For example, the productionhas the following four associated items:
In order to construct the finite control of an LR parser, we want to associate items with viable prefixes. Therefore, an item"is valid for some viable prefix </>«! if and only if there exists some rightmost derivation
in which t is a string of terminal symbols. In general, there may be many valid items for a viable prefix. For example, in the LR(0) grammar of the previous subsection which contained the productions
the items
are all valid for the viable prefix E – . The rightmost derivation
shows that the item „, is valid for E – . In this case, by using the previous definition of valid item, we have Similarly, the rightmost derivation
shows thatis also valid. In this case,and Finally, the rightmost derivation
demonstrates that the itemis valid. These observations can also be mverified by examining the finite control of the parser given in Fig. 7-7. Since a grammar has a finite number of productions, it follows that the number of valid items associated with a viable prefix will also be finite.
We want to associate a finite set of valid items with each state in a parser’s finite-control mechanism. Since each viable prefix has a finite number of valid items, it follows that a potentially infinite set of viable prefixes (representing one state of the parser) will also have only a finite number of associated valid items. Consequently, if we can obtain the finite set of valid items associated with a state of the parser, the finite-state control of the parser can be derived. This is the approach that we will follow. The machine of Fig. 7-7 will be constructed using the notion of a valid item.
We now, in an informal manner, proceed to construct the sets of items associated with an LR(0) parser. To get the construction process started, the item is associated with the start state of the finite-state control, where S denotes the distinguished symbol of the grammar. The itemreflects the fact that the parser has not recognized any (nonempty) viable prefixes. For the current example, state 0 is associated with the item
A closure or completion operation is required on the initial item in the item set. To obtain the closure of an itemwhere X is a nonterminal, we include in the item set all items of the formIn the example, the closure of the itemfirst generates the itemsand Note that using the previous notation, andApplying the closure operation to the itemyields two more items:Repeating the closure operation on any of the items obtained so far does not produce any new items. The initial item set C0 associated with the start state of the parser is
Intuitively, the parser in state 0 expects to encounter a string derivable from £#. Strings derivable from E are those which are derivable from either T, E + T, or E — T. The strings derivable from T will begin with either an /’ or a (. The first item in C0 is the basis item. The remaining items in C0 are the result of performing the closure or completion operation.
Thus far, the machine does not recognize symbols in a viable prefix. To have a machine recognize symbols, we need to define a new operation called a read or successor operation. Let an item of the formbe associated with some state U, where X denotes any symbol in the grammar vocabulary. A read or successor operation associates the itemwith some state V in the machine. The read operation defines a transition from state U to state V on scanning the symbol X. Recall from the previous subsection that transitions in the finite control of an LR parser occur when an input symbol is scanned or the parser is in a reduce state. The read operation is instrumental in the creation of new states in the construction process.
In the current example, a read operation on the nonterminal E results in the creation of three new items associated with a new state, say, state 1. These items, which form the basis set of the current state, are the following:
The completion or closure of these items yields no new items, since in each case a terminal symbol follows the dot. Let C\ denote the new item set.
Similarly, a read operation or transition from state 0 to state 2 on nonterminal T yields a basis itemThe closure of this item also yields no new items. We call the item set C2. A similar situation occurs when performing a transition from state 0 to state 3 on the terminal /’, thus yielding the item C3 denotes this new item set. A fourth (and final) transition from state 0 to state 4 occurs on reading a left parenthesis. The basis set of state 4 is the item The closure operation on this set, however, generates the following new items:
The six items form the item set C4. The portion of the machine constructed thus far is illustrated in Fig. 7-8.
Intuitively, state 1 has recognized an E and the next symbol expected is #, +, or -. State 3 recognizes a handle (/’) and is a reduce state. Similarly, state 2 is a reduce state which recognizes a handle (T). Finally, state 4 reflects the fact that a left parenthesis was scanned. The next symbols to be encountered are those derivable from an expression (E). The symbols derivable from an expression are those derivable from a term (T).
Continuing the machine-generation process, we can perform a transition from state 1 to state 5 on reading the symbol #. The item set for state 5 is and denotes another reduce state. Generating a new basis set of items through a read operation and then performing a completion operation on that set yields another collection of items associated with some perhaps new state.
Figure 7-8 Partial construction of an LR(0) machine.
Observe at this point that it is possible to generate a basis set of items which has already been associated with a previously generated state. If such a "duplicate" state occurs, it is simply ignored. The process of creating a new state by performing a read (or transition) operation and subsequent completions of the basis items must terminate. Since each set of items is finite, the machine must have a finite number of states. Continuing the state-generation process just described for the current grammar yields the machine given earlier in Fig. 7-7. The sets of items associated with each state of the machine are summarized in Table 7-32.
Before giving an algorithm for constructing the finite control of an LR(0) parser, we summarize the operations required for its construction. We want to generate the item sets associated with each state of the parser.
Table 7-32 The items sets for an LR(0) parser
State 0: |
State 5: |
||
State 6: |
|||
State 7: |
|||
State 1: |
|||
State 8: |
|||
State 9: |
|||
State 2: |
State 10: |
||
State 3: |
|||
State 4: |
|||
State 11: |
|||
The three following rules are used to construct the finite-state control of the parser:
1. Start operation: Letbe a rule in grammar G where S denotes the starting symbol of the grammar. Then the itemis associated with the start state. This operation gets the construction process started: The start state, in general, will eventually contain several items.
2. The closure (or completion) operation: Ifis an item associated with a state U, where X is a nonterminal symbol, then each item of the formmust also be associated with state U. This operation is applied repeatedly until no more items are associated with that state.
3. The read (or successor) operation: Let X be a vocabulary symbol in an itemassociated with some state U. Then the itemis associated with a transition to state V on reading symbol X. Note that U and V can be the same state (see state 4 in Fig. 7-7).
These operations are used in the following algorithm, which generates the finite control for an LR(0) parser.
Algorithm LR(0)_MACHINE. Given a grammar G, with starting symbol S, this algorithm generates the finite control of the LR(0) parser for the given grammar. The machine contains the following sets of items:
where C0 is the initial or start item set. The states of the LR(0) parser are
where each state j is constructed or obtained from the item set Cr The algorithm generates the collection of items C.
1. [Generate the starting item set]
(a) Assign the starting item set a subscript (say, 0) and then place the itemin the set C0.
(b) Perform the completion operation on the item, i.e., look for a nonterminal symbol X which follows the dot ".", and include the items of the formin the set whereis a production in G. The completion operation is also performed on all new items derived.
(c) Call the set of items obtained in parts a and b the set C0.
2. [Generate the sets of items C]
Repeat through step 4 until no more distinct item sets occur.
3. [Perform a read operation on an item]
(a) Perform a read operation on an item (initially, in set C0). Start a new state set, say, Cr
(b) If this item set is already there, ignore it. At any rate, obtain a new basis item set or exit.
4. [Perform the completion of the new state set]
Perform the completion operation on the basis set of items for Cr
The algorithm generates the item sets from which a machine, such as that given in Fig. 7-7, can be derived. As mentioned earlier, there are two kinds of states in an LR(0) machine, read states and reduce (or apply) states. A reduce state is associated with a completed item, i.e., one in which the dot is at the right end of the item. Such a state indicates that a handle has been detected. An LR(0) parser with a finite control can be described by a pair of tables, one table to indicate the action taken by the parser such as accept, reduce, push, and invalid, and the other to indicate the transition from one state to another. We return to the specifics of these tables later.
An LR(0) grammar (see exercises) is clearly not sufficiently powerful to express the constructs contained in current programming languages. These languages require at least one symbol lookahead in order to be parsed deterministi-cally. The generation of an LR (0) machine, however, can be a starting point for obtaining an LR parser based on one symbol of lookahead. We develop such a simple class of parsers in the next subsection.