Scanning—Theory and Practice - Crafting a Compiler

Java Reference

In-Depth Information

because scanners must wade through the text of a program character by char-

acter.

Suppose we want to implement a very fast compiler that can compile a

program in a few seconds. We will use 30,000 lines per minute (500 lines per

second) as our goal. (Compilers such as Turbo C

achieve such speeds.) If an

average line contains 20 characters, the compiler must scan 10,000 characters

per second. On a processor that executes 10,000,000 instructions per second,

even if we did nothing but scanning, we would have only 1,000 instructions

per input character to spend. But because scanning is not the only thing

a compiler does, 250 instructions per character is more realistic. This is a

rather tight budget, considering that even a simple assignment takes several

instructions on a typical processor. Although faster processors are common

these days and 30,000 lines per minute is an ambitious speed, clearly a poorly

coded scanner can dramatically impact a compiler's performance.

++

3.2 Regular Expressions

Regular expressions are a convenient way to specify various simple (although

possibly infinite) sets of strings. They are of practical interest because they

can specify the structure of the tokens used in a programming language. In

particular, you can use regular expressions to program a scanner generator.

Regular expressions are widely used in computer applications other than

compilers. The Unix R

utility grep uses them to define search patterns in files.

Unix shells allow a restricted form of regular expressions when specifying file

lists for a command. Most editors provide a “context search” command that

enables you to specify desired matches using regular expressions.

A set of strings defined by regular expressions is called a regular set .For

purposes of scanning, a token class is a regular set whose structure is defined

by a regular expression. A particular instance of a token class is sometimes

called a lexeme ; however, we simply call a string in a token class an instance

of that token. For example, we call the stringabcan identifier if it matches the

regular expression that defines the set of valid identifier tokens.

Our definition of regular expressions starts with a finite character set, or

vocabulary (denoted

). This vocabulary is normally the character set used

by a computer. Today, the ASCII character set, which contains 128 characters,

is very widely used. Java, however, uses the Unicode character set. This set

includes all of the ASCII characters aswell as awide variety of other characters.

An empty, or null, string is allowed (denoted

Σ

λ

). This symbol represents an

empty bu

er in which no characters have yet been matched. It also represents

an optional part of a token. Thus, an integer literal may begin with a plus or

minus, or, if it is unsigned, it may begin with

ff

λ

.

Crafting a Compiler

Search WWH ::

Custom Search

Home