Java Reference
In-Depth Information
add a newaccepting state to the previous FA that corresponds to a pseudotoken
of the form ( D + .
). If this token is recognized, then we strip the trailing . from
the token text and bu
er it for later reuse. We then return the token code of an
integer literal. In fact, we are simulating the e
ff
ff
ect of a context-sensitive match
as provided by Lex's
operator.
Multiple character lookahead may also be a consideration in scanning
invalid programs. For example, inC (andmany other programming languages)
12.3e+qis an invalid token. Many C compilers simply flag the entire character
sequence as invalid (a floating-point value with an illegal exponent). If we
follow our general scanning philosophy of matching the longest valid character
sequence, the scanner could be backed up to produce four tokens. Since this
token sequence (12.3, e, +, q) is invalid, the parser will detect a syntax error
when it processes the sequence. Whether we decide to consider this a lexical
error or a syntax error (or both) is unimportant. Some phase of the compiler
must detect the error.
We can build a scanner that can perform general backup. This allows
the scanner to operate correctly no matter how token definitions overlap. As
each character is scanned, it is bu
/
ered and a flag is set indicating whether
the character sequence scanned so far is a valid token (the flag might be the
appropriate token code). If we are not in an accepting state and cannot scan
any more characters, then backup is invoked. We extract characters from the
right end of the bu
ff
er and queue them for rescanning. This process continues
until we reach a prefix of the scanned characters flagged as a valid token. This
token is returned by the scanner. If no prefix is flagged as valid, then we have
a lexical error. (Lexical errors are discussed in Section 3.7.6.)
Bu
ff
ering and backup are essential in general-purpose scanners such as
those generated by Lex. It is impossible to know in advance which regular
expression pattern will be matched. Instead, the generated scanner (using its
internal DFA) follows all patterns that are possible matches. If a particular
pattern is found to be unmatchable, then an alternative pattern that matches
a shorter input sequence may be chosen. The scanner will back up to the
longest input prefix that can be matched, saving bu
ff
ff
ered characters that will
be matched in a later call to the scanner.
As an example of scanning with backup, consider the previous example
of 12.3e+q. Figure 3.15 shows how the bu
er is built and flags are set. When
the q is scanned, backup is invoked. The longest character sequence that is a
valid token is12.3, so a floating-point literal is returned. The remaining input
e+ is requeued so that it can be rescanned later.
ff
3.7.5 Performance Considerations
Our main concern in this chapter is showing how to write correct and robust
scanners. Because scanners do so much character-level processing, they can
 
 
Search WWH ::




Custom Search