Java Reference
In-Depth Information
because scanners must wade through the text of a program character by char-
acter.
Suppose we want to implement a very fast compiler that can compile a
program in a few seconds. We will use 30,000 lines per minute (500 lines per
second) as our goal. (Compilers such as Turbo C
achieve such speeds.) If an
average line contains 20 characters, the compiler must scan 10,000 characters
per second. On a processor that executes 10,000,000 instructions per second,
even if we did nothing but scanning, we would have only 1,000 instructions
per input character to spend. But because scanning is not the only thing
a compiler does, 250 instructions per character is more realistic. This is a
rather tight budget, considering that even a simple assignment takes several
instructions on a typical processor. Although faster processors are common
these days and 30,000 lines per minute is an ambitious speed, clearly a poorly
coded scanner can dramatically impact a compiler's performance.
++
3.2 Regular Expressions
Regular expressions are a convenient way to specify various simple (although
possibly infinite) sets of strings. They are of practical interest because they
can specify the structure of the tokens used in a programming language. In
particular, you can use regular expressions to program a scanner generator.
Regular expressions are widely used in computer applications other than
compilers. The Unix R
utility grep uses them to define search patterns in files.
Unix shells allow a restricted form of regular expressions when specifying file
lists for a command. Most editors provide a “context search” command that
enables you to specify desired matches using regular expressions.
A set of strings defined by regular expressions is called a regular set .For
purposes of scanning, a token class is a regular set whose structure is defined
by a regular expression. A particular instance of a token class is sometimes
called a lexeme ; however, we simply call a string in a token class an instance
of that token. For example, we call the stringabcan identifier if it matches the
regular expression that defines the set of valid identifier tokens.
Our definition of regular expressions starts with a finite character set, or
vocabulary (denoted
). This vocabulary is normally the character set used
by a computer. Today, the ASCII character set, which contains 128 characters,
is very widely used. Java, however, uses the Unicode character set. This set
includes all of the ASCII characters aswell as awide variety of other characters.
An empty, or null, string is allowed (denoted
Σ
λ
). This symbol represents an
empty bu
er in which no characters have yet been matched. It also represents
an optional part of a token. Thus, an integer literal may begin with a plus or
minus, or, if it is unsigned, it may begin with
ff
λ
.
 
 
Search WWH ::




Custom Search