Java Reference
In-Depth Information
Atoken's type explains the token's membership in the terminal alphabet.
All instances of a given terminal have the same token type.
Atoken's semantic value provides additional information about the
token.
For terminals such as plus, no semantic information is required, because only
one token (+) can correspond to that terminal. Other terminals, such as id
and num, require semantic information so that the compiler can record which
identifier or number has been scanned.
The scanner in Figure 2.5 finds the beginning of a token by first skipping
over any blanks. Scanners are often instructed to ignore comments and sym-
bols that serve only to format the text, such as blanks and tabs. Next, using
a single character of lookahead (the
method), the scanner determines if
the next token will be a num or some other terminal. Because the code for
scanning a number is relatively complex, it is relegated to the S
peek
pro-
cedure shown in Figure 2.6. Otherwise, the scanner is moved to the next input
character (using
can
D
igits
ces to determine the next token.
For most programming languages, the scanner's job is not so easy. Some
tokens (+)canbeprefixesofothertokens(++); other tokens such as comments
and string constants have special symbols involved in their recognition. For
example, a string constant is usually surrounded by quote symbols. If such
symbols are meant to appear literally in the string constant, then they are
usually escaped by a special character such as backslash (
advance
), which su
). Variable-length
tokens such as identifiers, constants, and commentsmust bematched character
by character. If the next character is part of the current token, it is consumed.
When a character that cannot be part of the current token is reached, scanning
is complete. Some input files may contain character sequences that do not
correspond to any token and should be flagged as errors.
The inum-andfnum-finding code in Figure 2.6 is written ad hoc, yet the
logic of its construction is patterned after the tokens' regular expressions. A
recurring theme in compiler construction is the use of such principled ap-
proaches and patterns to guide the crafting of a compiler's phases.
While the code in Figures 2.5 and 2.6 serves to illustrate the nature of a
scanner, we emphasize that the most reliable and expedient methods for con-
structing scanners do so automatically from regular expressions, as covered in
Chapter 3. Such scanners are reasonably e
\
cient and correct by construction,
given a correct set of regular-expression specifications for the tokens.
2.5 Parsing
The parser is responsible for determining if the stream of tokens provided
by the scanner conforms to the language's grammar specification.
In most
 
 
Search WWH ::




Custom Search