Scanning—Theory and Practice - Crafting a Compiler

Java Reference

In-Depth Information

Because C has a rather elaboratemacro definition and expansion facility, macro

processing and included files are typically handled by a preprocessing phase

prior to scanning and parsing. The preprocessor,cpp, may in fact be used with

languages other than C to obtain the e

ff

ects of source file inclusion, macro

processing, and so on.

Some languages (such as C and PL

I) include conditional compilation

directives that control whether statements are compiled or ignored. Such

directives are useful in creatingmultiple versions of a programfroma common

source. Usually, these directives have the general form of an if statement;

hence, a conditional expression will be evaluated. Characters following the

expression will either be scanned and passed to the parser, or ignored until

an end if delimiter is reached. If conditional compilation structures can be

nested, a skeletal parser for the directives may be needed.

Another function of the scanner is to list source lines and to prepare for the

possible generation of error messages. While straightforward, this requires a

bit of care. The most obvious way to produce a source listing is to echo

characters as they are read, using end-of-line characters to terminate a line,

increment line counters, and so on. However, this approach has a number of

shortcomings:

/

•

Error messages may need to be printed. These should appear merged

with source lines, with pointers to the o

ff

ending symbol.

•

A source line may need to be edited before it is written. This may involve

inserting or deleting symbols (for example, for error repair), replacing

symbols (because of macro preprocessing), and reformatting symbols

(to prettyprint a program, that is, to print a program with text properly

indented, if-elsepairs aligned, and so on).

•

Source lines that are read are not always in a one-to-one correspondence

with source listing lines that are written. For example, in Unix a source

program can legally be condensed into a single line (Unix places no limit

on line lengths). A scanner that attempts to bu

ff

er entire source lines

may well overflow bu

ff

er lengths.

In light of these considerations, it is best to build output lines (which

normally are bounded by device limits) incrementally as tokens are scanned.

The token image placed in the output bu

er may not be an exact image of

the token that was scanned, depending on error repair, prettyprinting, case

conversion, or whatever else is required. If a token cannot fit on an output

line, then the line is written and the bu

ff

er is cleared. (To simplify editing, you

should place source line numbers in the program's listing.) In rare cases, a

token may need to be broken; for example, if a string is so long that its text

exceeds the output line length.

ff

Crafting a Compiler

Search WWH ::

Custom Search

Home