Lexical Analysis - Introduction to Compiler Construction in a Java World

Java Reference

In-Depth Information

In describing our lexical tokens, we usually separate them into categories. In our example

program, public , class , static , and void are categorized as reserved words. Factorial ,

main , String , args , System , out , and println are all identifiers; an identifier is a token in

its own right. The string != is a literal, a string literal in this instance. The rest are operators

and separators; notice that we distinguish between single-character operators such as + and

multi-character operators like >= . In describing the tokens, we can usually simply list the

different reserved words and operators, and describe the structure of identifiers and literals.

2.2 Scanning Tokens

We call the program that breaks the input stream of characters into tokens a lexical analyzer

or, less formally, a scanner.

A scanner may be hand-crafted, that is, a program written by the compiler writer; or

it may be generated automatically from a specification consisting of a sequence of regular

expressions. The lexical analyzer that we describe in this section will be hand-crafted. We

look at generated scanners later.

When we look at the problem of producing a sequence of tokens from a program like

that above, we must determine where each token begins and ends. Clearly, white space

(blanks, tabs, new lines, etc.) plays a role; in the example above it is used to separate the

reserved word public from class , and class from the identifier Factorial . But not all

tokens are necessarily separated by white space. Whether or not we have come to the end

of a token in scanning an input stream of characters depends on what sort of token we are

currently scanning. For example, in the context of scanning an identifier, if we come across

a letter, that letter is clearly part of the identifier. On the other hand, if we are in the

process of scanning an integer value (consisting of the digits 0 to 9), then that same letter

would indicate that we have come to the end of our integer token. For this reason, we find

it useful to describe our scanner using a state transition diagram.

Identifiers and Integer Literals

For example, consider the state transition diagram in Figure 2.1. It may be used for recog-

nizing j-- identifiers and decimal integers.

In a state transition diagram, the nodes represent states, and directed edges represent

moves from one state to another depending on what character has been scanned. If a

character scanned does not label any of the edges, then the unlabeled edge is taken (and

the input is not advanced). Think of it as a machine that recognizes (scans) identifiers and

integer literals.

Consider the case when the next token in the input is the identifier x1 . Beginning in the

start state, the machine considers the first character, x ; because it is a letter, the machine

scans it and goes into the id state (that is, a state in which we are recognizing an identifier).

Seeing the next 1 , it scans that digit and goes back into the id state. When the machine

comes across a character that is neither a letter nor a digit, nor an underscore, nor a dollar

sign, it takes the unmarked edge and goes into the idEnd state (a final state indicated by

a double circle) without scanning anything.

If the first character were the digit 0 (zero), the machine would scan the zero and go

directly into the (final) intEnd state. On the other hand, if the first character were a non-

zero digit, it would scan it and go into the integer state. From there it would scan succeeding

Introduction to Compiler Construction in a Java World

Search WWH ::

Custom Search

Home