Scanning—Theory and Practice - Crafting a Compiler

Java Reference

In-Depth Information

System.out.println("Four score and seven years ago,");

Figure 3.16: An example of double buffering.

length, then we can extend the bu

er size, perhaps by using Java-styleVector

objects rather than arrays to implement bu

ff

ers.

We can speed up a scanner not only by doing block reads, but also by

avoiding unnecessary copying of characters. Because so many characters are

scanned, moving them from one place to another can be costly. A block read

enables direct reading into the scanning bu

ff

er rather than into an intermediate

input bu

ff

er. As characters are scanned, we need not copy characters from the

input bu

er unlesswe recognize a tokenwhose textmust be saved or processed

(an identifier or a literal). With care, we can process the token's text directly

from the input bu

ff

er.

At some point, using a profiling tool such as qpt, prof, gprof,orpixie

may allow you to find unexpected performance bottlenecks in a scanner.

ff

3.7.6 Lexical Error Recovery

A character sequence that cannot be scanned into any valid token results in a

lexical error . Although uncommon, such errors must be handled by a scanner.

It is unreasonable to stop compilation because of what is often a minor error,

so usually we try some sort of lexical error recovery . Two approaches come to

mind:

1. Delete the characters read so far and restart scanning at the next unread

character.

2. Delete the first character read by the scanner and resume scanning at the

character following it.

Both approaches are reasonable. The former can be done by resetting the

scanner and beginning scanning anew. The latter is a bit harder to do but

also is a bit safer (because fewer characters are immediately deleted). Non-

deleted characters can be rescanned using the bu

ff

ering mechanism described

previously for scanner backup.

In most cases, a lexical error is caused by the appearance of some illegal

character, which usually appears as the beginning of a token. In this case, the

two approaches work equally well. The e

ects of lexical error recovery might

well create a syntax error, which will be detected and handled by the parser.

Consider . . .for$tnight.... The$ would terminate scanning of for.Since

no valid token begins with $, it would be deleted. Then tnight would be

ff

Crafting a Compiler

Search WWH ::

Custom Search

Home