Scanning—Theory and Practice - Crafting a Compiler

Java Reference

In-Depth Information

Scanner

Module

(in C)

Lex

Specification

Figure 3.6: The operation of the Lex scanner generator.

programmed. Rather, we can focus on the character structure of tokens and

how they are to be processed.

The primary purpose of this section is to show how regular expressions

and related information are presented to scanner generators. A helpful way

to learn Lex is to start with the simple examples presented here and then

gradually generalize them to solve the problem at hand. To inexperienced

readers, Lex's rules may seem unnecessarily complex. It is best to keep in

mind that the key is always the specification of tokens as regular expressions.

The rest is there simply to increase e

ciency and handle various details.

Lex's approach to scanning is simple. It allows the user to associate regular

expressions with commands coded in C (or C

). When input characters that

match the regular expression are read, the associated commands are executed.

Users of Lex do not specify how to match tokens, except by providing the

regular expressions. The associated commands specify what should be done

when a particular token is matched.

Lex creates a filelex.yy.cthat contains an integer functionyylex().This

function is normally called from the parser whenever another token is needed.

The value that yylex() returns is the token code of the token scanned by

Lex. Tokens such as whitespace are deleted simply by having their associated

command not return anything. Scanning continues until a command with a

return in it is executed.

Figure 3.7 illustrates a simple Lex definition for the three reservedwords—

f, i,andp—of the ac language introduced in Chapter 2. When a string

matching any of these three reserved keywords is found, then the appropriate

token code is returned. It is vital that the token codes that are returned when a

token is matched are identical to those expected by the parser. If they are not,

then the parser will not see the same token sequence produced by the scanner.

This will cause the parser to generate false syntax errors based on the incorrect

token stream it sees.

It is standard for the scanner and parser to share the definition of token

codes to guarantee that consistent values are seen by both. The file y.tab.h,

++

Search WWH ::

Custom Search

Home