Top-Down Parsing (Compiler Writing) Part 2

Recursive-Descent Parsers

In this section the brute-force parsing strategy just discussed is restricted so that backup is not allowed. This revised method of parsing, though less general, performs better than the brute-force method. This section also describes an approach to error detection and recovery for this modified parsing technique.

Recursive-descent parsing algorithm.. The top-down parsing method given in the previous section is very general but can be very time-consuming. A more efficient (though less general) method that does not allow backup can be devised. This method of parsing is known as recursive descent. It should be noted, however, that this highly recursive technique does not work on all context-free grammars. That is, certain grammars require backup in order for successful parsing to occur.

In the recursive-descent method of parsing, a sequence of production applications is realized in a sequence of function calls. In particular, functions are written for each nonterminal. Each function returns a value of true or false depending on whether or not it recognizes a substring which is an expansion of that nonterminal. The programming mechanism for the handling of recursive function calls provides the stacking capability required, thus freeing the user from having to establish and manipulate a stack explicitly, assuming that the language used supports recursion.

The remainder of this subsection presents an algorithm for the recursive-descent parsing method for the following context-free grammar:


tmp9-80_thumb[2]

Note that this gramma:- contains right-recursive rules. In this way we have avoided the left-recursive problem. Also note that the alternates of (term) and (expr) have been ordered in such a way as to check for the longest alternate first. This approach is important when one alternate is a proper head of another alternate, and such is the case for the alternates of both (term) and (expr).

A recursive-descent parser for this example grammar contains one recursive function for each nonterminal in the grammar (i.e., (factor), (term), and (expr)) which parses phrases for that nonterminal. Each function specifies where to begin a search for a phrase of its associated nonterminal. The function looks for a phrase by comparing the input string beginning at a specified point with the alternates of its associated nonterminal and invoking other functions to recognize the subgoals when required.

The following algorithm and its associated functions parses without any backup the strings of the language described by the example grammar. This is accomplished easily by using one context symbol following the part of the phrase which has already been parsed. The context symbol in question can be either a ‘ + ‘ or a ‘ * Note that there are four functions: GET_CHAR, EXPR, FACTOR, and TERM. Also note that each function has no parameters or local variables. These observations are characteristic of recursive-descent parsers.

Algorithm RECDSNT. This algorithm calls four programmer-defined functions: GET_CHAR, EXPR, FACTOR, and TERM. The first one, GET_CHAR, always returns the next character in the input string, INPUT; and assigns it to the global variable NEXT. CHAR is local to GET_CHAR and contains the value to be returned. The latter three functions are all recursive and return either Boolean true or false. The variable CURSOR is also global to all routines and denotes the present character position in the input string. The variable i is a local index variable. SUB and LENGTH are built-in string-manipulation functions which return a specified substring of a string and the number of characters in a string, respectively.

tmp9-81

 

 

 

tmp9-82

The following points are made concerning the algorithm:

1. The variable NEXT is global and contains the next symbol of the input string which is being processed. When a function to find a new goal is called, the first symbol to be examined is already in NEXT. Similarly, before returning from a function after a successful match, the symbol following the substring found by the function is put into NEXT.

2. The function GET_CHAR obtains the next input symbol.

3. To begin parsing, the main algorithm invokes the function GET_CHAR, which in turn places the leftmost symbol in the input string in NEXT.

A trace of the parse for the input

tmp9-83_thumb[2]

is given in Fig. 6-4. Note that h denotes the position of the current input symbol.

A final note concerning the parsing algorithm just given concerns the returning of true and false by the recursive functions. These truth values need not be returned. In fact, an explicit call to another routine can be made instead of returning false.

 Trace of Algorithm RECDSNT for the string (i + t)* it

Figure 6-4 Trace of Algorithm RECDSNT for the string (i + t)* it tmp9-85_thumb[2]

The part of the algorithm that returns true can be eliminated. In other words, the recursive functions can be replaced by recursive procedures.

In the previous approach, right recursion was used to avoid the problem of left recursion. Another approach is to use an iterative form of the grammar.

tmp9-86_thumb[2]

In this case, the constructs {* (factor)} and {+ (term)} are implemented by using iteration in the functions of (term) and (factor), respectively. A revised algorithm based on this grammar is left as an exercise.

Thus far we have ignored error detection and error recovery in recursive-descent parsers. We next modify the parser just developed to handle syntax errors.

Error detection in recursive-descent parsers.. The recursive-descent parser presented in the previous subsection has little or no practical value. Any nontrivial grammar parsed in this manner will halt on the first error. A chain of "panic" travels up the stack, and at best, a diagnostic might point to the offending character. Alternately, a single error might put the parser so out of step with the input symbols that it would generate a barrage of near-meaningless error messages. However, a standard method of adding error detection and recovery to recursive-descent parsers has evolved and has been described in detail by Wirth (1976). The language being parsed must have a simple structure and reserved keywords for such a parser to perform sensibly. In general, such parsers either skip or insert symbols upon detecting an error until a reasonable analysis of the source text may be resumed. The following method provides a set of ground rules, which might be modified heuristically.

First, recursive procedures replace recursive functions. As before, each procedure P corresponds to some nonterminal or syntactic class A. However, once called, a procedure does not return any value to the calling procedure but rather insists on finding an instance of a syntactic class.

To illustrate, here is the function EXPR rewritten as a procedure.

Procedure EXPR

Procedure EXPR

Next, we must construct the sets of starters, followers, and stoppers for every syntactic class A. The set of starters of A is defined as the set of terminals which may legally start an instance of A. The set of followers of A is defined as the set of terminals which may follow an instance of A. The set of stoppers of A is a set of keywords of the grammar being parsed which may not necessarily follow an instance of A but should not be ignored under any circumstances. The set of stoppers may not be universal to every syntactic class.

Some data structure is necessary to represent these sets. Languages such as PASCAL which provide built-in set constructs greatly simplify this task. At this point, this problem is left to the reader.

Consider the following grammar for a simple block-nested language:

tmp9-88_thumb[2]

To simplify the discussion, we will leave the definition of (cond) up to the reader. We also add the rules:

tmp9-89_thumb[2]

This constraint artificially makes ‘p’ and *i’ keywords, which is not usually the case. However, it does not affect the generality of the discussion. When a statement in such a grammar is parsed, a statement not beginning with a keyword is typically assumed to be an assignment statement. This grammar will generate simple programs of the form:

tmp9-90_thumb[2]

For the syntactic class (statement), the set of starters consists of the symbols "begin," "if," and "i." The set of followers consists solely of the semicolon symbol. Constructing the set of stoppers is not as simple, but it may be reasonably assumed that it should at least include "procedure," "begin," "end," "if," the semicolon, and the period. For the rest of this discussion the set followers will describe the union of the sets of followers and stoppers.

A parsing procedure for some nonterminal A is constructed as follows: On entry to p, the current symbol is tested to ensure that it may legally start an instance of A. If so, the parsing procedure proceeds as before. But suppose the symbol is illegal. A first impulse might be to report an error and scan symbols until a legitimate starter of A is found. However, consider what would happen while parsing a (block) where the word "begin" was spelled incorrectly. If the block contained no blocks nested within, the parser working on the syntactic class (block) would scan and therefore ignore all the source inside the block.

A better idea would be to report an error and scan symbols until either a starter or a follower of A is found. Any symbols which have been ignored should be flagged. If the current symbol is a follower, report the appropriate error and return control to the calling program as if an instance of A has been successfully parsed.

Otherwise, procedure p expects to find an instance of A and begins parsing, which may of course include other recursive procedure calls. Upon completion of parsing an instance of A, it is necessary to check that the current symbol is a legitimate follower of A. If not, an error message is printed and input symbols are scanned until a proper follower is found. Control is then returned to the calling program.

In this manner, we create the following very general parsing algorithm for nearly any syntactic class.

Procedure CLASS(STARTERS, FOLLOWERS). This algorithm adds error detection and recovery to the parsing procedures of the type described thus far. CLASS could be any syntactic class—(expr), (statement), (block), etc. The function SYMJN(SYMBOL, SET) performs a starch of the set SET and returns the value true if SYMBOL is a member. Otherwise, SYMJN returns the value . false. The procedure SKIP_TO(SET) scans input symbols until a symbol is found which is a member of the set SET. STARTERS is the set of starters of the syntactic class CLASS. FOLLOWERS is the set of followers. STARTERS_OR .FOLLOWERS is the union of the set of starters and the set of followers of CLASS. NEXT is a string which contains the current symbol. We assume the existence of a function GET_SYM which ‘gets the next symbol from the input stream. The dotted lines represent the skeleton for the parsing procedure as it was before error detection and recovery were added.

tmp9-91_thumb[2]

For example, the three lines of Procedure EXPR presented earlier might replace the dotted lines in the example. As such, this clearly entails a lot of redundant coding. At least two methods exist for eliminating the redundant coding. Welsh and McKeag (1980) make use of the PASCAL construct known as an envelope. In a recent paper, Topcor (1982) describes a parser which can be implemented in languages such as C, PL/I, and PASCAL which allow the passing of procedures as parameters. If PROC is the name of the parsing procedure passed to CLASS, then the dotted lines in algorithm CLASS are replaced by a call to a bare-bones parsing procedure PROC, which may make recursive calls to CLASS, etc.

Still more generalization is possible. The two "Write" statements might be replaced by a procedure ERROR which might be passed several parameters including the value of NEXT, a pointer to a diagnostic, a pointer to the symbol expected, etc., so that more intelligent error reporting might be done.

A strict implementation of this method yields useful results, though certain problems remain. Consider the following trivial program:

tmp9-92_thumb[2]

Assuming a rigorous implementation of this method, the missing semicolon will generate numerous error messages. (Why?) In many instances, it is useful to make a distinction between the stoppers and the followers of a syntactic class to better handle omissions of symbols. Again consider the incorrect assignment statement:

tmp9-93_thumb[2]

A difficult problem arises. An identifier "i" has been "misspelled." The assignment operator has also been improperly entered. Both are typical errors encountered using a parser based on this grammar. In a single sentence, we encounter two instances where a strict implementation of this parser would lead to numerous error messages. The decision we must make upon encountering each error is whether or not to persistently parse the current syntactic class or to give up and go on to the next class.

It is trivial to find exceptions to any hard-and-fast rule implemented in the parsing code. The point to remember is that programmers do not make their mistakes algorithmically. It is typical to forget an end statement or a semicolon but not so typical to omit an assignment operator. The parser does not share such prejudices. In practice, many test programs are necessary to test any parser’s behavior over as wide as possible a range of human behavior to ensure that detection is complete, reporting is accurate, and recovery is full.

In general a good parser should not collapse on any input sequence. It should be able to detect and mark all incorrect constructs and in particular should recover well from errors that are made frequently.

We have to this point examined parsing strategies which involve no backup or full backup. We next examine a parsing strategy which involves using partial or limited backup.

Top-Down Parsing with Limited Backup

This section examines a representation of a top-down parsing technique which is more general than the recursive-descent method of Sec. 6-1.2 and less powerful than the brute-force method of Sec. 6-1.1, which had full backup capabilities. In other words, the method which is presented here has a backup capability, but there are some grammars for which the brute-force method succeeds but the method of this subsection fails. An example will be used to show this later.

Knuth (1971) provides a very concise representation of a top-down parsing technique. This representation takes the form of a set of assembler-type statements, each having an opcode and two address fields, AT and AF. Let h be a global index into the input stringtmp9-113_thumb[2]There are two kinds of opcodes:

1. Terminal symbol; e.g., a

tmp9-119_thumb[2]

2. Address of procedure; e.g., [A], a nonterminal symbol in brackets action: call the procedure in location A. (The procedure returns true or false as a value.) If true, then go to AT; otherwise go to AF.

There are three kinds of address fields:

1. Nonterminal symbol; e.g., A meaning: go to the procedure in location A (i.e., a "go to," not a "call")

2. Blank meaning: proceed to the next instruction

3. True or false true—return to the calling procedure with the value true false—reset h to the value it had on entry to this procedure; return to the calling procedure with the value false

In addition, OK is the address in the main program to which control passes when a parse is successful; ERROR is the address to which transfer is made on an unsuccessful parse, either to attempt recovery or simply to emit a message. For example, given the grammar whose rules are:

tmp9-120_thumb[2]

with S being the goal symbol andtmp9-121_thumb[2]the parsing program could be developed by applying the previous definitions of opcodes and address fields. Such a program is presented in the following discussion. Note that the symbol # is a special marker symbol which appears only at the end of the given input string.

Empty productions can be handled readily. If we have X -> e as one of the productions, the following subprogram achieves the desired results:

tmp9-123_thumb[2]

where X and N are locations and a is any terminal. The effect is that we automatically return true from X but without advancing h, precisely the actions we desire.

There is a standard form into which the rules can be put. This general form is

tmp9-124_thumb[2]

Iftmp9-125_thumb[2]then the rule istmp9-126_thumb[2]A rule which is not in this standard form can always be rewritten so that we have a set of rules that are in this form. Given

tmp9-129_thumb[2]

For our example grammar, after adding the new nonterminals E" and T" and the productionstmp9-130_thumb[2]in order to convert each rule to standard form, Fig. 6-5 can serve as its parsing program.

Let us trace through the action of this program in terms of its calling and checking sequences as it parses the string (b* b)#. Assume an initial call to the procedure associated with the goal symbol of the grammar, and an initial value of 1 for h. Such a trace is given in Fig. 6-6.

Location

Op-code

AT

AF

S

Ef ]

ERROR

#

OK

ERROR

£

[H

False

[f]

True

False

E’

[£"]

True

[/V]

True

True

E"

+

False

m

False

[£']

True

False

T

[f]

False

[H

True

False

T>

ir"]

True

[/V]

True

True

T"

False

IF]

False

IT’)

True

False

F

b

True

(

False

[f]

False

)

True

False

N

b

False

False

Figure 6-5 Parser representation of a sample grammar.

Parse trace of the string

Figure 6-6 Parse trace of the stringtmp9-133_thumb[2]

The problem with this algorithm is that we cannot reconstruct the parse. The true returns are actually the steps in the parse, but after parsing is complete, we no longer know this return sequence. A solution, of course, is to emit code whenever we return true from a procedure. Alternatively, a list can be constructed of the return sequence.

The syntax tree for this parse is given in Fig. 6-7. By examining it, we see that the order in which productions are applied is precisely the order in which the true returns occur in the execution of the parsing program. This order can be shown by listing the nonterminal left-hand sides of the productions applied. This order is FFT’T'T’TE ‘EFT’TE ‘ES.

In the design of parsing algorithms based on rules similar to those of grammar G, or on rules cast into the specified standard form, a certain amount of care must be taken. For example, if we have the rule X -» a\ab, the string ab will never be recognized as an X; the value true will always be returned once the a has been encountered. Having the rules written in this standard form makes several solutions to this problem possible. One solution is to order the alternatives properly so that the longest is first. In this case, the rule would be rewritten as X -* ab\a. Another method is to factor out the common element a. The rule could then be written as X -» a[b], and a program to handle this rule would be (after a bit of optimizing):

tmp9-136_thumb[2]

This approach has the advantage of saving one backup step.

Syntax tree for parse of (6 * ft).

Figure 6-7 Syntax tree for parse of (6 * ft).

Location

Op-code

AT

. AF

S

[S1]

ERROR

#

OK

ERROR

S’

I A)

False

IB]

True

False

A

\A<]

True

a

True

False

A1

a

False

a

True

False

B

b

True

a

False

c

True

False

Figure 6-8

As mentioned at the beginning of this subsection, this method does not succeed for all grammars. Consider the grammar

tmp9-138_thumb[2]

Putting this grammar in standard form yields the following grammar:

tmp9-139_thumb[2]

A parser program for this grammar is given in Fig. 6-8. For the valid input string aac#, the parser program fails in its attempt to construct a parse. We leave it as an exercise for the reader to verify this fact. The brute-force method of Sec. 6-1.1, however, succeeds in parsing the input string aac#. We also leave the verification of this point as an exercise.

Next post:

Previous post: