Java Reference
In-Depth Information
In describing our lexical tokens, we usually separate them into categories. In our example
program, public , class , static , and void are categorized as reserved words. Factorial ,
main , String , args , System , out , and println are all identifiers; an identifier is a token in
its own right. The string != is a literal, a string literal in this instance. The rest are operators
and separators; notice that we distinguish between single-character operators such as + and
multi-character operators like >= . In describing the tokens, we can usually simply list the
different reserved words and operators, and describe the structure of identifiers and literals.
2.2 Scanning Tokens
We call the program that breaks the input stream of characters into tokens a lexical analyzer
or, less formally, a scanner.
A scanner may be hand-crafted, that is, a program written by the compiler writer; or
it may be generated automatically from a specification consisting of a sequence of regular
expressions. The lexical analyzer that we describe in this section will be hand-crafted. We
look at generated scanners later.
When we look at the problem of producing a sequence of tokens from a program like
that above, we must determine where each token begins and ends. Clearly, white space
(blanks, tabs, new lines, etc.) plays a role; in the example above it is used to separate the
reserved word public from class , and class from the identifier Factorial . But not all
tokens are necessarily separated by white space. Whether or not we have come to the end
of a token in scanning an input stream of characters depends on what sort of token we are
currently scanning. For example, in the context of scanning an identifier, if we come across
a letter, that letter is clearly part of the identifier. On the other hand, if we are in the
process of scanning an integer value (consisting of the digits 0 to 9), then that same letter
would indicate that we have come to the end of our integer token. For this reason, we find
it useful to describe our scanner using a state transition diagram.
Identifiers and Integer Literals
For example, consider the state transition diagram in Figure 2.1. It may be used for recog-
nizing j-- identifiers and decimal integers.
In a state transition diagram, the nodes represent states, and directed edges represent
moves from one state to another depending on what character has been scanned. If a
character scanned does not label any of the edges, then the unlabeled edge is taken (and
the input is not advanced). Think of it as a machine that recognizes (scans) identifiers and
integer literals.
Consider the case when the next token in the input is the identifier x1 . Beginning in the
start state, the machine considers the first character, x ; because it is a letter, the machine
scans it and goes into the id state (that is, a state in which we are recognizing an identifier).
Seeing the next 1 , it scans that digit and goes back into the id state. When the machine
comes across a character that is neither a letter nor a digit, nor an underscore, nor a dollar
sign, it takes the unmarked edge and goes into the idEnd state (a final state indicated by
a double circle) without scanning anything.
If the first character were the digit 0 (zero), the machine would scan the zero and go
directly into the (final) intEnd state. On the other hand, if the first character were a non-
zero digit, it would scan it and go into the integer state. From there it would scan succeeding
 
Search WWH ::




Custom Search