Scanning—Theory and Practice - Crafting a Compiler

Java Reference

In-Depth Information

Even if a source listing is not requested, each token should contain the

line number in which it appeared. The token's position in the source line

may also be useful. If an error involving the token is noted, the line number

and position marker can be used to improve the quality of error messages by

specifying where in the source file the error occurred. It is straightforward

to open the source file and then list the source line containing the error, with

the error message immediately below it. Sometimes, an error may not be

detected until long after the line containing the error has been processed. An

exampleofthisisagoto to an undefined label. If such delayed errors are rare

(as they usually are), then a message citing a line number can be produced,

for example, “Undefined label in statement 101.” In languages that freely

allow forward references, delayed errors may be numerous. For example,

Java allows declarations of methods after they are called. In this case, a file

of error messages keyed with line numbers can be written and later merged

with the processed source lines to produce a complete source listing. Source

line numbers are also required for reporting post-scanning errors in multipass

compilers. For example, a type conversion error may arise during semantic

analysis; associating a line number with the error message greatly helps a

programmer understand and correct the error.

A common view is that compilers should just concentrate on translation

and code generation and leave the listing and prettyprinting (but not error

messages) to other tools. This considerably simplifies the scanner.

A scanner is designed to read input characters and partition them into tokens.

When the end of the input file is reached, it is convenient to create an end-of-file

pseudocharacter.

In Java, for example, InputStream.read(), which reads a single byte,

returns -1 when end-of-file is reached. A constant, Eof,definedas-1,can

be treated as an “extended” ASCII character. This character then allows the

definition of an EndFile token that can be passed back to the parser. The

EndFile token is useful in a CFG because it allows the parser to verify that the

logical end of a program corresponds to its physical end. In fact, LL(1) parsers

(discussed in Chapter 5) and LALR(1) parsers (discussed in Chapter 6) require

an EndFile token.

What will happen if a scanner is called after end-of-file is reached? Ob-

viously, a fatal error could be registered, but this would destroy our simple

model in which the scanner always returns a token. A better approach is to

continue to return the EndFile token to the parser. This allows the parser to

handle termination cleanly, especially since the EndFile token is normally syn-

tactically valid only after a complete program is parsed. If the EndFile token

Search WWH ::

Custom Search

Home