Bottom-Up Parsing (Compiler Writing) Part 1

A second class of parsing techniques builds parse tree from the leaves toward its root, i.e., in a bottom-up manner. Several bottom-up parsing techniques with no backup are presented in this topic. All the parsing techniques except the first are based on formal syntactic notions. Several formal classes of grammars have been created. For a given grammar in a particular class, there corresponds a constructor algorithm that generates parsing tables for that grammar. Each class of grammars has an associated generalized parsing algorithm that uses the automatically generated parsing tables for the given grammar to parse strings in its associated language. Error-detection and -correction techniques for these parsing techniques have, to a significant degree, been automated.

The topic begins with a description of one of the earliest informal parsing techniques, which is based on Polish notation. Section 7-2 introduces the class of operator precedence grammars. The notions used in parsing operator precedence grammars are extended to define the class of precedence grammars in Sec. 7-3. Section 7-4 gives a detailed description of the classes of SLR(l), LALR(l), and LR(1) grammars and their associated parsers. The topic concludes with a comparison of the parsing techniques.

All the formal parsing techniques (except for extended-precedence parsing) use one symbol lookahead in order to deterministically parse input strings. Although some techniques can be extended to use more than one symbol of lookahead, the storage requirements of the expanded parsing tables tend to make such extensions too costly. Furthermore, most programming language constructs can be parsed with one symbol of lookahead.


POLISH EXPRESSIONS AND THEIR COMPILATION

In this section we are concerned with the compilation of infix arithmetic expressions. We shall find it to be more efficient to evaluate an infix expression by first converting it to a suffix (or prefix) expression and then evaluating the latter. By using this approach, we can eliminate the repeated scanning of an infix expression in order to obtain its value.

The first subsection introduces the notions of suffix (Polish) notation. The equivalence between infix and suffix expressions is informally described. The second subsection formalizes the previously introduced notions. A general algorithm for converting infix expressions to a suffix form is formulated. The remainder of the section describes syntactic error-detection techniques for suffix (or prefix) expressions.

Polish Notation

In this subsection we introduce the notation for Polish expressions. Although the discussion is confined to expressions, the theory can be expanded to many other language constructs. Such extensions will be discussed in Chap. 10. This notation is also known as tukasiewiczian notation (after the Polish logician Jan Lukasiewicz). Polish notation offers certain’computational advantages over the traditional infix notation. By infix notation we mean that each operator is positioned between its operands, as in the expression a + b. Initially, an inductive definition of the set of valid infix expressions is given. The evaluation of infix and suffix expressions is also briefly introduced.

Let us now consider the set of all valid, completely parenthesized arithmetic expressions consisting of single-letter variables, nonnegative integers, and the four operators +, —, *, and /. The set of all such valid expressions is contained in the following recursive definition:

1. All single-letter variables and nonnegative integers are valid infix expressions.

2. If a and /? are valid infix expressions, then so aretmp9-431 and (a//?).

3. The only valid infix expressions are those defined by steps 1 and 2.

Note that in the writing of a valid expression (according to this definition), complete parenthesization must be used. In order to reduce the severity of this restriction, a number of conventions have been developed. Perhaps the most obvious of these is one which permits the dropping of the outermost parentheses. An expression such as ,tmp9-432is considered to be valid, as well as

tmp9-433the form required by the preceding definition.

Another method of reducing the number of parentheses is to assign an order of precedence to the connectives or operators. Once this assignment is done, further reductions can be made by requiring that for any two operators of equal precedence which appear in a formula, the left one is evaluated first. Such operators are said to be left-associative. Such a convention is commonly used in arithmetic; for example, 2 – 6/3 + 8 stands for (2 – (6/3)) + 8.

Let us first consider the evaluation of unparenthesized arithmetic expressions in which the precedences of the operators * and / are considered to be equal and of higher value than those of + and —. An example of such an expression is

tmp9-437

According to the preceding convention, this expression stands fortmp9-438tmp9-439. In evaluating this expression, we must scan it repeatedly from left to right. The numbers associated with the subexpressions indicate the steps of the evaluation. Clearly, this evaluation process is not optimal, since repeated scanning of the expression must be performed.

If an expression contains parentheses, the order of evaluation may be altered by these parentheses. For example, in evaluating the expressiontmp9-440 we first compute a + b and c + d and then perform the indicated multiplication. By the suitable use of parentheses, it is possible to write expressions which make the order of evaluating the subexpressions independent of operator precedence. This independence can be realized by parenthesizing subexpressions in such a manner that, for each operator, there is a pair of parentheses. This pair encloses the operator and its associated operands. The parenthetical level of an operator is defined to be the total number of pairs of parentheses that surround it. A pair of parentheses has the same parenthetical level as that of the operator to which it corresponds, that is, of the operator which is immediately enclosed by this pair. Such an expression is known as a fully parenthesized expression. For example, in the fully parenthesized expression

tmp9-444

the integers below the operators specify the parenthetical level of each operator. During the evaluation of a fully parenthesized expression, the subexpression which contains the operator with the highest parenthetical level is evaluated first. If two or more operators have the same parentheticai level, they are evaluated from left to right. Once the subexpressions containing operators at the highest parenthetical level have been evaluated, the subexpressions containing the operators at the next highest level are evaluated in the same manner. This process is continued until all the operations have been performed.

Regardless of the extent of parenthesization, repeated scanning of an infix expression in a left-to-right manner is required in order to evaluate it. Therepeated scanning of an infix expression can be avoided if it is first converted to an equivalent parenthesis-free suffix or prefix expression in which the subexpressions have the form

tmp9-445

The prefix and suffix forms of the notation are also called prefix Polish and reverse Polish, respectively. Examples of expressions in their prefix, infix, and suffix forms are given in Table 7-1.

Note that in the conversion of ah infix expression to its Polish equivalent, the variables in each form are all in the same relative positions. Furthermore, both Polish forms are parenthesis-free, and the operators in these forms are rearranged according to the rules of operator precedence.

A fully parenthesized infix expression is easily translated to suffix notation by initially converting the innermost parenthesized subexpression and then proceeding toward the outside of the expression. For example, in the expression

tmp9-446

the innermost parenthesized subexpression of level 3 is

tmp9-447

and it is converted to yz + . This suffix expression becomes the first operand of the operator — at level 2. Consequently, the subexpression yz + – 5 of level 2 is converted to the suffix equivalent yz + 5 — , and finally at level 1, the term x * yz + 5 — is transformed to the final suffix form xyz + 5 – *.

Certain FORTRAN compilers initially convert partially parenthesized expressions to a fully parenthesized form before conversion to a suffix or prefix form is performed.

Table 7-1 Equivalent expressions

Suffix

Prefix

(reverse

(prefix

Infix

Polish)

Polish)

tmp9-448

Suffix form is associated with a bottom-up parse while the prefix form is linked to a top-down parse. More will be said about this point in the next subsection. In the remainder of this discussion we concern ourselves with suffix Polish notation. An analogous discussion holds for prefix Polish notation.

Let us now turn to the problem of evaluating a suffix expression. For example, to evaluate the suffix expressiontmp9-449in infix), we scan this string from left to right until we encounter the operator + . The two operands w and * which immediately precede this operator are its operands, and the subexpression wx + is replaced by its value. Let us assume that this value is denoted by a. This reduces the original suffix string to ayz + *. Continuing the scanning beyond a, the next operator encountered is +, whose operands are y and z, and the evaluation results in a value which we call fi. The intermediate string now becomes a/? *. Continuing the scanning beyond yS, the next operator is *. The operands of this operator are a and /?. Performing this multiplication, we obtain the result y. Note that only a single left-to-right scan was required in our evaluation.

This method of evaluating suffix expressions can be summarized by the following four rules, which are repeatedly applied until all operators have been processed:

1. Find the leftmost operator in the expression.

2. Select the two operands immediately preceding the operator found in step 1.

3. Perform the indicated operation.

4. Replace the operator and its operands with the result.

In this subsection, we have examined the basic notions of Polish notation. We have not, however, indicated how one gets from infix to Polish. We turn to this problem in the next subsection.

Conversion of Infix Expressions to Polish Notation

Initially, we develop an informal algorithm for translating unparenthesized infix expressions to suffix Polish. This informal algorithm is then modified to handle parenthesized expressions. This generalized algorithm is given in our algorithmic notation.

The conversion of an infix expression without parentheses to a suffix Polish expression is a straightforward matter. This conversion is based on the precedence of the operators and requires the use of a stack. The Polish expression is stored in some output string. Recall from the previous subsection that the variables and constants of the expression retain their order throughout the conversion process. The operators, however, are reordered in the output string, depending on their relative precedence, and it is for this reason that a stack is needed.

Consider the precedence values associated with the four arithmetic operators shown in Table 7-2. The precedence values are specified by the precedence function /. Also included in the table is a precedence value for single-letter variables. The special symbol # has the smallest precedence value in the table.

Table 7-2

Symbol

Precedence /

tmp9-451

1

tmp9-452

2

Variables and

nonnegative integers

3

#

0

Initially, the stack is set to the special symbol #. This symbol ensures that the stack is always nonempty. The main portion of the informal algorithm deals with the precedence-value comparison of the current input symbol and the top element of the stack. If the precedence value of this input symbol is greater than that of the symbol on top of the stack, then the current input symbol is pushed onto the stack and the next input symbol is scanned. If, however, the precedence value of the current input symbol is less than or equal to that of the stack top symbol, then we write this top element in the output string, after which we compare the precedence values of the same current input symbol and the new element on the stack. This process continues until all input symbols have been processed. A trace of this informal algorithm for the infix expression a — b * c + d/5 is given in Table 7-3.

Note that, because of their high precedence, a variable or a constant is always placed immediately on the stack. When the very next input symbol is encountered, the variable or constant on the stack is written in the output string. Our informal algorithm can be changed so that the precedence value of the current input symbol is tested for a value of 3. If this test is successful, the current input symbol is either a variable or a constant and it can be placed in the output string without first being placed on the stack.

Table 7-3 Translation oftmp9-453into suffix Polish

Contents of stack

Current input

(rightmost symbol is

Reverse-Polish

symbol

top of stack)

expression

tmp9-454 tmp9-455 tmp9-456
tmp9-457 tmp9-458 tmp9-459
tmp9-460 tmp9-461 tmp9-462
tmp9-463 tmp9-464 tmp9-465
tmp9-466 tmp9-467 tmp9-468
tmp9-469 tmp9-470 tmp9-471
tmp9-472 tmp9-473 tmp9-474
tmp9-475 tmp9-476 tmp9-477
tmp9-478 tmp9-479 tmp9-480
tmp9-481 tmp9-482 tmp9-483
tmp9-484 tmp9-485 tmp9-486

This is not done, however, for reasons of generality and uniformity which become important as the infix expression becomes more complex than the ones we are now considering.

Also note that a current input symbol with a precedence value greater than that of the stack top will result in the input symbol being placed on the stack. This is natural, since this symbol has priority over other symbols in the stack. This phenomenon reflects the fact that the last operator to be placed on the stack is the first to be written into the output string. Furthermore, for left-associative operators, when the precedence of the current input symbol is equal to that of the symbol in the stack, the leftmost symbol is to be written out first. Therefore, our algorithm will convert a – b — c to ab – c — and not to abc — —. The suffix Polish expression ab – c – corresponds to the infix expression (a — b) — c, while abc — — is equivalent to a – (b – c).

Let us now turn our attention to the conversion of incompletely parenthesized expressions to suffix Polish. When programmers write an expression containing parentheses, they do not normally write it in a completely parenthesized form. Intuitively, when a left parenthesis is encountered in the infix expression, it should be placed on the stack regardless of its present contents. However, when it is in the stack, it should be removed and discarded only when a right parenthesis is encountered in the infix expression, at which time the right parenthesis is also ignored. A left parenthesis can be forced on the stack by assigning to it a precedence value greater than that of any other operator. Once on the stack, the left parenthesis should have another precedence value (called its stack precedence) which is smaller than that of any other operator. We can get rid of the left parenthesis on the stack by checking for an incoming right parenthesis in the infix expression. The right parenthesis is never inserted on the stack. Actually, we can modify the previous informal algorithm in such a manner that the left and right parentheses can perform the same function as the special symbol ‘#’ used earlier. The original table of precedence values (Table 7-2) can be revised to have both an input- and stack-precedence value for each operator and operand. This revision gets rid of the special symbol ‘#’. The algorithm also becomes more general, since other operators, such as the relational, logical, unary, and ternary operators, can be added without making the algorithm significantly more complex. Table 7-4 is a revised table which includes parentheses. Each symbol has both input-symbol and stack-symbol precedences, except for a right parenthesis which does not possess a stack precedence since it is never placed on the stack. Table 7-4 also contains the exponentiation operator f ■ ‘All arithmetic operators except exponentiation have an input precedence which is lower in value than their stack precedence. This preserves the left to right processing of operators of equal precedence in an expression. The exponentiation operator in mathematics is right-associative. The expressiontmp9-487is equivalent to the parenthesized expressiontmp9-488and not to the expressiontmp9-489

The conversion of an infix expression into reverse Polish operates in much the same way as the previous informal algorithm. A left parenthesis is initially placed on the stack and a right parenthesis is concatenated onto the end of the infix expression. The algorithm is as follows.

Table 7-4

Symbol

Input-precedence function /

Stack-precedence function g

+ , -

1

2

*,/

3

4

T

6

5

Variables and constants

7

8

(

9

0

)

0

Algorithm REVERSE_POLISH. Given an input string INFIX containing an infix expression which has been padded on the right with ‘)’ and whose symbols have precedence values given by the functions f and g in Table 7-4, a vector S, used as a stack, and a function NEXTCHAR, which when invoked returns the next character of its argument, this algorithm converts the string INFIX to reverse Polish and stores it in a string called POLISH. RIGHT_PAREN is a local variable which is set to true when NEXT is a right parenthesis. TOP is an index into the S stack and points to the top element of the stack. TEMP is a temporary local variable.

tmp9-493

Repeat through step 6 while there are unprocessed symbols in INFIX

4. [Get next input symbol and set false

tmp9-494

5. [Emit symbols of higher precedence than NEXT]

tmp9-495tmp9-496

A trace of the stack contents and the output string POLISH for the padded infix expression

tmp9-497is given in Table 7-5.

Note that in this algorithm we have not checked for stack overflow. We have assumed that there is sufficient space in the stack to hold the stacked symbols.

It is possible to extend the precedence functions to handle relational operators, conditional statements, unconditional transfers (go to), subscripted variables, and many other features found in present programming languages. Some exercises at the end of this section will deal with these extensions.

We have been concerned until now with the conversion of an infix expression to reverse Polish. The motivation behind this conversion is that reverse Polish can be converted into object code by linearly scanning the Polish string once.

The problem of converting infix expressions to prefix Polish will not be discussed in this section. A simple algorithm based on the scanning of an infix expression from right to left can be easily formulated. In many cases the entire infix string is not available, but it is obtained one symbol at a time in a left-to-right manner (because this is the way we write programs). Therefore, a practical algorithm for converting infix to prefix must be based on a left-to-right scan of the infix string. To facilitate such an algorithm, however, two stacks instead of the usual one can be used. This is left as an exercise.

Table 7-5 Translation of infix expression to suffix Polish

Current input symbol

Contents of stack

(rightmost symbol is top of stack)

Reverse-Polish expression

tmp9-499 tmp9-500 tmp9-501
tmp9-502 tmp9-503 tmp9-504
tmp9-505 tmp9-506 tmp9-507
tmp9-508 tmp9-509 tmp9-510
tmp9-511 tmp9-512 tmp9-513
tmp9-514 tmp9-515 tmp9-516
tmp9-517 tmp9-518 tmp9-519
tmp9-520 tmp9-521 tmp9-522
tmp9-523 tmp9-524 tmp9-525
tmp9-526 tmp9-527 tmp9-528
tmp9-529 tmp9-530 tmp9-531
tmp9-532 tmp9-533 tmp9-534

In our discussion of the infix-to-suffix conversion process we have ignored the problems of detecting syntactic errors in the infix expressions being converted. We now turn to a discussion of this topic.

Error Detection for Infix Expressions

The preceding subsection describes an algorithm which converts valid infix arithmetic expressions into suffix Polish expressions. This section considers modifications to this algorithm which enables it to detect invalid infix expressions when they are presented to it.

Since we are attempting to detect deviations from the definition of valid infix expressions, we now present a definition similar to that given in Sec. 7-1.1.

1. All single-letter variables and nonnegative integers are valid infix expressions.

2. If a and /? are valid infix expressions, then so aretmp9-535 and (a).

3. The only valid infix expressions are those defined by steps 1 and 2.

Note that the definition has been modified to allow incompletely parenthesized infix expressions.

Unparenthesized infix expressions are basically sequences of operators tmp9-536and operands (variables and integers) in which no two operands are adjacent and no two operators are adjacent. Parentheses are inserted into this sequence of symbols subject to the following restrictions:

1. For each left parenthesis in the infix expression there is a right parenthesis appearing later in the expression.

2. The left parenthesis ‘(‘ is always followed by an operand or another ‘(‘.

3. The right parenthesis ‘)’ is always preceded by an operand or another ‘)’.

The first restriction can be checked by monitoring the stack level in the infix-to-suffix conversion process. The precedence relationships are such that when a right parenthesis is encountered in the left-to-right scan of the infix expression, all symbols down to and including the first left parenthesis are popped from the stack. Recall that a left parenthesis is initially placed on the stack. If a right parenthesis appears in the infix expression before a matching left parenthesis has appeared, the resulting unstacking operation will completely empty the stack. Unless we have reached the end of the infix expression, this constitutes a syntactic error in the infix expression. In this way an error is detected if there are too many right parentheses in an expression. If there are too few right parentheses in an expression, the final right parenthesis concatenated on the end of the infix expression will fail to empty the stack. Thus we must check that the stack is empty if and only if we have reached the end of the infix expression.

Table 7-6 Valid character pairs for infix expressions

h

Second symbol

a

+ -

■ */

(

)

a

0

1

1

0

1

First

+ -

1

0

0

1

0

symbol

*/

1

0

0

1

0

(

1

0

0

1

0

)

0

1

1

0

1

In order to enforce the remaining restrictions on parentheses and the requirement that operands and operators must alternate, we must find a way to reject certain pairs of characters which cannot appear in valid infix expressions. We do this by constructing a character-pair matrix h, as shown in Table 7-6. This matrix contains a row and a column for each class of symbols appearing in infix expressions. If the element of the matrix corresponding to a given pair of symbols is a 1, the pair of symbols can appear in an infix expression. If the matrix entry is 0, the pair of symbols cannot appear in an infix expression. The entries in the matrix are chosen in such a way as to prevent adjacent operands and adjacent operators, as well as to make sure that ‘(‘ is followed by an operand or ‘(‘ and that ‘)1 is preceded by an operand or 1)’.

We now give the infix-to-sufiix conversion algorithm, modified so as to detect syntactic errors in the infix expression being converted to Polish form. The algorithm incorporates the stack checking and character-pair checking described in the preceding paragraphs.

Algorithm IMPROVED_REVERSE_POLISH. Given an input string INFIX containing an infix expression which has been padded on the right with ‘)’, this algorithm computes the suffix Polish translation of the expression and stores it in the string POLISH. If INFIX does not contain a well-formed infix expression, the algorithm prints ‘INVALID EXPRESSION’ and halts. The values of the precedence functions f and g are given in Table 7-4. The values of the character-pair matrix h are given in Table 7-6. The function NEXTCHAR returns the next character of its argument. The vector S, along with the index TOP, is used as a stack. It is assumed that S is large enough to hold the required number of symbols. LAST and NEXT contain symbols of the infix expression. TEMP and RIGHT_PAREN are both local variables.

tmp9-539

2. [Scan the infix expression]

Repeat through step 5 while there are unprocessed symbols in INFIX

3. [Get next input symbol, check character pair, set right-parenthesis flag]

tmp9-540

4. [Emit symbols of higher precedence than NEXT]

tmp9-541

5. [Push NEXT onto stack?!

tmp9-542

6. [Finished. Check that the stack is empty]

tmp9-543

7. [Bad expression]

tmp9-544

Algorithm IMPROVED_REVERSE_POLISH will translate valid infix expressions into valid suffix expressions. Furthermore, it will never accept an invalid infix expression as input. Thus the user may have greater confidence in this improved version of the algorithm than in the original.

In Sec. 7-2 we consider a parsing method which is a logical extension of the infix-to-suffix conversion algorithm. This parsing method can be applied to a class of input languages far more general than infix arithmetic expressions.

Next post:

Previous post: