Scanning—Theory and Practice - Crafting a Compiler

Java Reference

In-Depth Information

include anticipating the tokens and contexts that may complicate scanning,

avoiding performance bottlenecks, and recovering from lexical errors.

Section 3.8 describes the theory used by tools such as Lex to turn regu-

lar expressions into executable scanners. While this material is not strictly

necessary to craft a compiler, the theoretical aspects of scanning are elegant,

relatively straightforward, and helpful in understanding both the power and

limitations of scanners.

3.1 Overview of a Scanner

The primary function of a scanner is to transform a character stream into a

token stream. A scanner is sometimes called a lexical analyzer ,or lexer .The

names “scanner,” “lexical analyzer,” and “lexer” are used interchangeably.

The ac scanner discussed in Chapter 2 was simple and could be coded by any

competent programmer. In this chapter, we develop a thorough and system-

atic approach to scanning that will allow us to create scanners for complete

programming languages.

We introduce formal notations for specifying the precise structure of to-

kens. At first glance, this may seem unnecessary because of the simple token

structure found in most programming languages. However, token structure

can be more detailed and subtle than one might expect. For example, consider

string constants in C, C

,andJava TM , which are surrounded by double

quotes. The contents of the string can be any sequence of characters except the

double quote, as that would terminate the string. A double quote can appear

literally in the string only if it is preceded ( escaped )byabackslash. Isthis

simple definition really correct? Can a newline character appear in a string?

In C it cannot, unless it is escaped with a backslash. This notation avoids a

“runaway string” that, lacking a closing quote, matches characters intended

to be part of other tokens. While C, C

++

, and Java allow escaped newlines in

strings, Pascal forbids them. Ada goes further still and forbids all unprintable

characters (precisely because they are normally unreadable). Similarly, are

null (zero-length) strings allowed? C, C

++

, Java, and Ada allow them, but

Pascal forbids them. In Pascal, a string is a packed array of characters and

zero-length arrays are disallowed.

A precise definition of tokens is necessary to ensure that lexical rules are

clearly stated and properly enforced. Formal definitions also allow a language

designer to anticipate design flaws. For example, virtually all languages pro-

vide syntax for specifying certain kinds of rational constants . Such constants

are often specified using decimal numerals such as 0.1 and 10.01.Should

the notation .1 or 10. also be allowed? In C, C

++

,andJava,suchnotation

is permitted, but in Pascal and Ada, it is not—and for an interesting reason.

Scanners normally seek to match as many characters as possible. Thus ABC is

++

Crafting a Compiler

Search WWH ::

Custom Search

Home