Java Reference
In-Depth Information
include anticipating the tokens and contexts that may complicate scanning,
avoiding performance bottlenecks, and recovering from lexical errors.
Section 3.8 describes the theory used by tools such as Lex to turn regu-
lar expressions into executable scanners. While this material is not strictly
necessary to craft a compiler, the theoretical aspects of scanning are elegant,
relatively straightforward, and helpful in understanding both the power and
limitations of scanners.
3.1 Overview of a Scanner
The primary function of a scanner is to transform a character stream into a
token stream. A scanner is sometimes called a lexical analyzer ,or lexer .The
names “scanner,” “lexical analyzer,” and “lexer” are used interchangeably.
The ac scanner discussed in Chapter 2 was simple and could be coded by any
competent programmer. In this chapter, we develop a thorough and system-
atic approach to scanning that will allow us to create scanners for complete
programming languages.
We introduce formal notations for specifying the precise structure of to-
kens. At first glance, this may seem unnecessary because of the simple token
structure found in most programming languages. However, token structure
can be more detailed and subtle than one might expect. For example, consider
string constants in C, C
,andJava TM , which are surrounded by double
quotes. The contents of the string can be any sequence of characters except the
double quote, as that would terminate the string. A double quote can appear
literally in the string only if it is preceded ( escaped )byabackslash. Isthis
simple definition really correct? Can a newline character appear in a string?
In C it cannot, unless it is escaped with a backslash. This notation avoids a
“runaway string” that, lacking a closing quote, matches characters intended
to be part of other tokens. While C, C
++
, and Java allow escaped newlines in
strings, Pascal forbids them. Ada goes further still and forbids all unprintable
characters (precisely because they are normally unreadable). Similarly, are
null (zero-length) strings allowed? C, C
++
, Java, and Ada allow them, but
Pascal forbids them. In Pascal, a string is a packed array of characters and
zero-length arrays are disallowed.
A precise definition of tokens is necessary to ensure that lexical rules are
clearly stated and properly enforced. Formal definitions also allow a language
designer to anticipate design flaws. For example, virtually all languages pro-
vide syntax for specifying certain kinds of rational constants . Such constants
are often specified using decimal numerals such as 0.1 and 10.01.Should
the notation .1 or 10. also be allowed? In C, C
++
,andJava,suchnotation
is permitted, but in Pascal and Ada, it is not—and for an interesting reason.
Scanners normally seek to match as many characters as possible. Thus ABC is
++
 
 
Search WWH ::




Custom Search