Top-Down Parsing (Compiler Writing) Part 1

In Sec. 2-4.2 the notions of top-down parsing and bottom-up parsing were introduced. Recall that top-down parsing was characterized as a parsing method which, beginning with the goal symbol of the grammar, attempted to produce a string of terminal symbols that was identical to a given source string. This matching process proceeded by successively applying the productions of the grammar to produce substrings from nonterminals. In this topic we will examine several methods of performing top-down parsing.

The types of grammars that we will be dealing with fall into the class of context-free grammars (see Sec. 2-4). Empty productions will be permitted, with the symbol e being used throughout to represent the empty string.

This topic is divided into two major sections. In Sec. 6-1 we will deal with general methods of top-down parsing. In this section brief mention will also be made of methods using partial or limited backup. Section 6-2 will discuss top-down parsing methods without backup.

GENERAL TOP-DOWN PARSING STRATEGIES

The nature of the top-down parsing technique is characterized by presenting three methods of performing such parses. The first method, called a brute-force method, is presented both informally and formally and is accompanied by a parsing algorithm. The second method, known as recursive descent, is a parsing technique which does not allow backup. The third method is a parsing technique which involves top-down parsing with limited or partial backup. This type of parsing is introduced in the form of a program written in assemblerlike statements. A comparison of this technique with full backup parsing is made. Error-detection and -recovery techniques are discussed briefly.


Brute-Force Approach

A top-down parse moves from the goal symbol to a string of terminal symbols. In the terminology of trees, this is moving from the root of the tree to a set of the leaves in the syntax tree for a program. In using full backup we are willing to attempt to create a syntax tree by following branches until the correct set of terminals is reached. In the worst possible case, that of trying to parse a string which is not in the language, all possible combinations are attempted before the failure to parse is recognized.

Top-down parsing with full backup is a " brute-force" method of parsing. In general terms, this method operates as follows:

1. Given a particular nonterminal that is to be expanded, the first production for this nonterminal is applied.

2. Then, within this newly expanded string, the next (leftmost) nonterminal is selected for expansion and its first production is applied.

3. This process (step 2) of applying productions is repeated for all subsequent nonterminals that are selected until such time as the process cannot or should not be continued. This termination (if it ever occurs) may be due to two causes. First, no more nonterminals may be present, in which case the string has been successfully parsed. Second, it may result from an incorrect expansion which would be indicated by the production of a substring of terminals which does not match the appropriate segment of the source string. In the case of such an incorrect expansion, the process is "backed up" by undoing the most recently applied production. Instead of using the particular expansion that caused the error, the next production of this nonterminal is used as the next expansion, and then the process of production application continues as before.

If, on the other hand, no further productions are available to replace the production that caused the error, this error-causing expansion is replaced by the nonterminal itself, and the process is backed up again to undo the next most recently applied production. This backing up continues either until we are able to resume normal application of productions to selected nonterminals or until we have backed up to the goal symbol and there are no further productions to be tried. In the latter case, the given string must be unparsable because it is not part of the language determined by this particular grammar.

As an example of this brute-force parsing technique, let us consider the simple grammar

tmp9-21_thumb

where S is the goal or start symbol. Figure 6-1 illustrates the working of this brute-force parsing technique by showing the sequence of syntax trees generated during the parse of the string ‘accd’.

Initially we start with the tree of Fig. 6-1 a, which merely contains the goal symbol. We next select the first production for S, thus yielding Fig. 6-1 b. At this point we have matched the symbol a in the string to be parsed. We now choose the first production for A and obtain Fig. 6-lc. Note, however, that we have a mismatch between the second symbol c of the input string and the second symbol b in the sentential form abd. At this point in the parse, we must back up. The previous production application for A must be deleted and replaced with its next choice. The result of performing this operation is to transform Fig. 6-lc into Fig. 6-1 d with the leftmost two characters of the given input string being matched. When the third symbol c of the input string is compared with the last symbol d of the current sentential form, however, a mismatch again occurs. The previously chosen production for A must be deleted. Since there are no more rules for A which can be selected, we must also delete the production for 5 and select the next production for S. This sequence of operations yields Fig. 6-le. The final step involves applying the first rule for B to Fig. 6-le. This application yields Fig. 6-1/.

Trace of a brute-force top-down parse for string ' accd1. (Note: The symbol | denotes the extent of the scanning process from left to right in the input string.)

Figure 6-1 Trace of a brute-force top-down parse for string ‘ accd1. (Note: The symbol | denotes the extent of the scanning process from left to right in the input string.)

The remaining input symbols are then matched with the remaining symbols in the sentential form of Fig. 6-1 /, thereby resulting in a successful parse.

The top-down parsing method just described will not work on every context-free grammar. We now examine a particular subclass of the context-free grammars which causes this method to fail.

Given a context-free grammar itmp9-23_thumb, a nonterminal X is said to be left-recursive iftmp9-24_thumbfor some stringtmp9-25_thumb. If a grammar G contains at least one left-recursive nonterminal, G is said to be left-recursive.

A grammar which has one or more nonterminals that are left-recursive can be presented with strings to parse that will cause our informal top-down parsing method to enter an infinite loop of production applications. As an example of this phenomenon, consider the grammar which generates the languagetmp9-30_thumbFigure 6-2 illustrates a trace of this situation for the attempted parse of the string ‘abc’.

Of course, this particular infinite loop can be avoided by reversing the order of the productions fortmp9-31_thumbor by rewriting the first production for A

so that the rule becomestmp9-32_thumb(thus eliminating the left recursion). Not all left recursions, however, can be handled in such an easy manner.

The topic of left recursion will be discussed in more detail later in this subsection and in Sec. 6-2.4, where a method will be given that allows us to test whether or not a grammar has any nonterminals that are left-recursive.

We now present a formal characterization of this parsing method due to Aho and Ullman (1972) and an algorithm which implements it. The algorithm makes use of two stacks—one to record the current sentential form established by the series of expansions to date, and the other to record the history of the parse to date (i.e., which productions have been applied and at what point in the parse they were applied).

Lettingtmp9-33_thumbbe the general form of the rules of the grammar, we specify that the alternative productions can be ordered and that A, is the index of a,— the /th alternate of A. The algorithm consists of a set of rules for manipulating the stacks and the input string in such a way that either the parse is constructed or a syntactic error is discovered.

To express these rules, we first define a configuration as an ordered 4-tuple tmp9-38_thumb where s is the state which can be one of q, normal parsing state, b, backtracking state, or t, termination state; where i is the index into the input stringtmp9-39_thumb withtmp9-40_thumbbeing #, an end marker; wheretmp9-41_thumbis a stack (top to the right) recording the current history of choices of alternates and input symbols checked; and wheretmp9-42_thumbis a stack (top to the left) recording the current sentential form, with the top element being the symbol currently being checked.

The initial configuration istmp9-43_thumb, where S is the goal symbol of the grammar. The algorithm operates by converting this initial configuration into a terminal configuration which yields no further configuration. One configuration gives rise to another according to a particular "yields" relation.

Nonterminating sequence of syntax trees generated from a left-recursive grammar in a top-down parse.

Figure 6-2 Nonterminating sequence of syntax trees generated from a left-recursive grammar in a top-down parse.

We define I-(" yields") as

tmp9-50_thumb

fit one of the following patterns: 1. Tree expansion (i.e., apply the first production of a nonterminal)

tmp9-51_thumb

being a production and being the first alternate for A. 2. Input symbol match (i.e., advance the index into the input string)

tmp9-52_thumb

3. Terminate (i.e., a successful parse has been performed)

tmp9-53_thumb

4. Input symbol mismatch (i.e., a generated substring failed to match the input)

tmp9-54_thumb

5. Backtrack on input (i.e., back up the input string index)

tmp9-55_thumb

6. Next alternate (i.e., replace with next production, if it exists, or with the nonterminal)

tmp9-56_thumb 

A (there is no parse possible for the string!)

tmp9-57_thumb 

A (there is no parse possible for the string!)

tmp9-58_thumb 

Configurations yield configurations according to steps 1 to 6 until no further configuration can be produced, at which point the algorithm halts. (We assume here that the problem of left recursion does not arise.) If, upon halting, the final configuration istmp9-59_thumb, we have produced a left parse of the input (i.e., a parse in which the leftmost nonterminal in a sentential form is the nonterminal that is rewritten). Otherwise the input is not part of the language generated by the grammar. If a left parse is produced, the ordered list of productions used in the parse istmp9-60_thumband h is defined as

tmp9-61_thumbtmp9-62_thumbtmp9-63_thumb

We now present an algorithm to parse strings according to Aho and Ullman’s (1972) top-down method with full backup. We assume that the productions of any grammar used in conjunction with this method have been pretested and all left recursion has been eliminated. To help describe the data structures used in the algorithm, the followin" grammar for partially parenthesized arithmetic expressions involving addition and multiplication will be used.

tmp9-64_thumbtmp9-65_thumb

These rules have been rewritten (from their usual form) in order to avoid left recursion. The method of eliminating direct left recursion in this example involves performing the following manipulations:

For the left-recursive nonterminal A whose rule is

tmp9-66_thumb

where /?, does not begin with A, replace this rule by the rules

tmp9-67_thumb

For example, the familiar set of rules for deriving partially parenthesized arithmetic expressions in + and * is left-recursive in both E and T in the grammar

tmp9-68_thumb

We can eliminate this left recursion by rewriting the rules as

tmp9-69_thumb

Note that in replacing the braces notation of our sample grammar from Chap. 2, we implicitly formulated the productions in a manner very similar to this method of removing direct left recursion.

We store the productions in two arrays: LHS and RHS. Array LHS has elements each of which contains three fields. These fields are NT, containing the string representing a nonterminal symbol, MAX#, containing the number of productions that have this nonterminal as a left-hand side, and FIRST, containing the index into the array of right-hand sides for the first production for this nonterminal. Array LHS can be thought of as these three arrays, NT, MAX#, and FIRST being operated in parallel.

Array RHS is a simple array containing the right-hand sides of all the productions, each right-hand side being stored as a string that fills one entry in RHS. All the productions of a particular nonterminal are stored in successive positions of RHS.

For convenience, we assume that the goal symbol of the grammar is stored as the first entry of array LHS, and that each nonterminal and terminal symbol is a string of length 1. A right side is then just the concatenation of zero or more symbols, each of length 1. In this example, we assume that E’ and T’ are symbols of length 1.

For our example grammar Gx, the LHS and RHS arrays appear as in Fig. 6-3.

The stacks required for the operation of the algorithm are implemented with an array, HIST, representing the history stack, and a string, SENT, which records the current sentential form. An element of the array HIST contains two fields—SYMB, which contains a stack symbol (either a terminal or a nonterminal symbol) and P#, which contains either the number of the alternate for the nonterminal in SYMB or the value zero if SYMB contains a terminal or the empty string.

Data-structure representation of a grammar.

Figure 6-3 Data-structure representation of a grammar.

Thus a value of zero in P# marks a stack entry as a terminal symbol, and a value of j in P# (for j > 0) and the nonterminal A in SYMB represent the stack entry Af. TJHIST points to the top of the history stack, HIST. The stack which represents the current sentential form, the string SENT, has as its top element the first (i.e., leftmost) symbol of SENT.

For example, the configuration (q, 2, ElTlFxb, T’E'#), which is generated as part of the parse of the input string b, would have its stacks represented as

tmp9-71_thumb

Note that the subscript associated with a nonterminal symbol in a configuration denotes the alternate number of that nonterminal; for example, the subscript in E1 denotes the first (and only) alternate of E.

Algorithm TD_FULL. Given arrays LHS and RHS which represent the productions of grammar G as described earlier, and stacks represented by the vectors HIST and SENT, this algorithm parses strings to determine if they are part of the language generated by grammar G. The variables NT, MAX#, FIRST, SYMB, and P# are as just described, and T_HIST is the stack-top pointer for stack HIST. The history stack HIST has an initial entry of the empty string (treated as a terminal symbol). The input string, of length n, is in the string variable T. The variable i is the index of T that points to the next symbol of input to be processed. The variable STATE records the state of the parse. It represents either a normal state, a backtracking state, or a termination state. The variables p and $ jointly contain the top element of the history stack, and variable t contains the leftmost symbol of the current sentential form. CASE is a local variable which is used to denote the current configuration of the sentential form. For convenience, we assume the availability of a function F, defined as

tmp9-72_thumb

This algorithm is, in essence, an implementation of the "yields" relation that allows one parse configuration to give way to another.

tmp9-73

 

 

 

 

 

tmp9-74

When this algorithm terminates, if the value of the variable STATE is not the string then the input cannot be parsed. If, on the other hand, the value is ‘t’, then a left parse is produced and the application of function h to the elements of stack HIST (from bottom to top) gives the numbers of the productions in the order in which they are applied in the parse.

For the example expression grammar and the input string ‘b’, a trace of the configurations generated by the algorithm in the course of the parse is

tmp9-75_thumb

By examining the general nature of the operation of the brute-force algorithm, we can see that its basic form is that of a highly recursive set of procedure calls. This view is obvious if we think of each application of a production to expand a nonterminal as an invocation of a procedure which expands the nonterminal.

There are problems associated with methods of top-down parsing with full backup. Because of the many procedure calls or the excessive amount of backtracking required in some cases, they can be exceedingly slow. In some cases, the number of possible paths through all potential syntax trees can become so large that the construction of a parse for certain sentences becomes prohibitively expensive.

Another problem with full backup methods is that error recovery can be very poor. (The reader should refer to Chap. 5 for a discussion of error-recovery techniques.) One does not really find out that the input string is syntactically erroneous until all combinations have been tried, and by that time one cannot tell the first point at which an error was detected and why there is an error because the pointer into the input has been reset.

Another difficulty is that if code has been emitted as the parse has progressed, then every time we have to backtrack, we are forced to erase this code. This requires additional bookkeeping to associate blocks of code with states of the parse so that the required erasing can be performed correctly.

An obvious solution to these difficulties is to minimize the amount of backup performed. How much backup can be eliminated depends, of course, on the grammar, but empirically it has been found that most (if not all!) of the features

deemed desirable in a programming language can be described by context-free grammars that can be parsed with no backup or limited backup. We will describe one such method in Sec. 6-2.

We reemphasize that the previous algorithms, due to backup, can be very inefficient. A recursive implementation of a top-down parsing technique which does not allow backup is given in the next subsection.

Next post:

Previous post: