Scanning—Theory and Practice - Crafting a Compiler

Java Reference

In-Depth Information

Many hand-coded scanners treat reserved words as ordinary identifiers

(as far as matching tokens is concerned) and then use a separate table lookup

to detect them. Automatically generated scanners can also use this approach,

especially if transition table size is an issue. After an apparent identifier is

scanned, an exception table is consulted to see if a reserved word has been

matched. When case is significant in reserved words, the exception lookup re-

quires an exact match. Otherwise, the token should be translated to a standard

form (all uppercase or lowercase) before the lookup.

There are several ways of organizing an exception table. One obvious

mechanism is a sorted list of exceptions suitable for a binary search. A hash

table also may be used. For example, the length of a token may be used as

an index into a list of exceptions of the same length. If exception lengths are

well distributed, then few comparisons will be needed to determine whether

a token is an identifier or a reserved word. Perfect hash functions are also

possible [Spr77, Cic80]. That is, each reserved word is mapped to a unique

position in the exception table and no position in the table is unused. A token is

either the reservedword selected by the hash function or an ordinary identifier.

If identifiers are entered into a string space or given a unique serial number

by the scanner, then reserved words can be entered in advance. Then, when

a string that looks like an identifier is found to have a serial number or string

space position smaller than the initial position assigned to identifiers, we know

that a reserved word rather than an identifier has been scanned. In fact, with

a little care we can assign initial serial numbers so that they exactly match the

token codes used for reserved words. That is, if an identifier is found to have

aserialnumber s ,where s is less than the number of reserved words, then s

must be the correct token code for the reserved word just scanned.

Compiler directives and pragmas control compiler options (for example, list-

ings, source file inclusion, conditional compilation, optimizations, and profil-

ing). They may be processed either by the scanner or by subsequent compiler

phases. If the directive is a simple flag, then it can be extracted from a token.

The command is then executed, and finally the token is deleted. More elabo-

rate directives, such as Ada pragmas, have nontrivial structure and need to be

parsed and translated like any other statement.

A scanner may have to handle source inclusion directives. These directives

cause the scanner to suspend the reading of the current file and begin the

reading and scanning of the contents of the specified file. Since an included

file may itself contain an include directive, the scanner maintains a stack of

open files. When the file at the top of the stack is completely scanned, it is

popped and scanning resumes with the file now at the top of the stack. When

the entire stack is empty, end-of-file is recognized and scanning is completed.

Search WWH ::

Custom Search

Home