Scanning—Theory and Practice - Crafting a Compiler

Java Reference

In-Depth Information

if it is not already there. Whether the identifier is entered or is already in the

table, a pointer to the symbol table entry is then returned from the scanner.

In block-structured languages, the scanner generally is not expected to

enter or look up identifiers in the symbol table because an identifier can be

used in many contexts (for example, as a variable, member of a class, or label).

The scanner usually cannot know when an identifier should be entered into

the symbol table for the current scope or when it should return a pointer to an

instance from an earlier scope. Some scanners just copy the identifier into a

private string variable (that cannot be overwritten) and return a pointer to it. A

later compiler phase, the type checker, then resolves the identifier's intended

usage.

Sometimes a string space is used to store identifiers in conjunction with a

symbol table (see Chapter 8). A string space is an extendable block of memory

used to store the text of identifiers. A string space eliminates frequent calls to

memory allocators such asnewor malloc. It also avoids the space overhead of

storing multiple copies of the same string. The scanner can enter an identifier

into the string space and return a pointer into the string space rather than the

actual text.

An alternative to a string space is a hash table that stores identifiers and

assigns to each a unique serial number . A serial number is a small integer that

can be used instead of a string space pointer. All identifiers that have the same

text get the same serial number; identifiers with di

erent

serial numbers. Serial numbers are ideal indices into symbol tables (which

need not be hashed) because they are small, contiguously assigned integers. A

scanner can hash an identifier when it is scanned and return its serial number

as part of the identifier token.

In some languages, such as C, C

ff

erent texts get di

ff

, and Java, case is significant; in others,

such as Ada and Pascal, it is not. When case is significant, identifier text

must be stored or returned exactly as it was scanned. Reserved word lookup

must distinguish between identifiers and reserved words that di

++

ff

er only in

case. However, when case is insignificant, case di

erences in the spelling of an

identifier or reserved word must be guaranteed to not cause errors. This can

be done by putting all tokens scanned as identifiers into a uniform case before

they are returned or looked up in a reserved word table.

Other tokens, such as literals, require processing before they are returned.

Integer and real (floating) literals are converted to numeric form and returned

as part of the token. Numeric conversion can be tricky because of the danger

of overflow or roundo

ff

errors. It is wise to use standard library routines such

as atoi and atof (in C) (Integer.intValueand Float.floatValue in Java).

For string literals, a pointer to the text of the string (with escaped characters

expanded) should be returned.

The design of C contains a flaw that requires a C scanner to do a bit of

special processing. The character sequencea(*b);can be a call to procedure

ff

Crafting a Compiler

Search WWH ::

Custom Search

Home