Java Reference
In-Depth Information
%%
[a-eghj-oq-z]
{ return(ID); }
%%
Figure 3.10: A Lex definition for ac's identifiers.
3.5.3 Using Regular Expressions to Define Tokens
Tokens are defined using regular expressions. Lex provides the standard reg-
ular expression operators, as well as others. Catenation is specified by the
juxtaposition of two expressions; no explicit operator is used. Thus [ab][cd]
will match any of ad, ac, bc,orbd. Individual letters and numbers match
themselves when outside of character class brackets. Other characters should
be quoted (to avoidmisinterpretation as regular expression operators). For ex-
ample, while(as used in C, C
, and Java) can be matched by the expressions
while, "while",or[w][h][i][l][e].
Case is significant. The alternation operator is
++
. As usual, parentheses
can be used to control grouping of subexpressions. Therefore, to match the
reserved word while and allow any mixture of uppercase and lowercase (as
required in Pascal and Ada), we can use (w
|
E).
Postfix operators (Kleene closure) and + (positive closure) are also pro-
vided, as is?(optional inclusion). For example,expr?matchesexprzero times
or once. It is equivalent to (expr)
|
W)(h
|
H)(i
|
I)(l
|
L)(e
|
λ
symbol. The character .matches any single character (other than a newline).
The character ˆ (when used outside a character class) matches the beginning
of a line. Similarly, the character $ matches the end of a line. Thus ˆA. e $
could be used to match an entire line that begins with A and ends with e.We
now define all of ac's tokens using Lex's regular expression facilities. This is
shown in Figure 3.11.
Recall that a Lex specification of a scanner consists of three sections. The
first, not used so far, contains symbolic names associatedwith character classes
and regular expressions. Symbolic definitions can often make Lex specifica-
tions easier to read, as illustrated in Figure 3.12. There is one definition per
line. Each definition line contains an identifier and a definition string, sepa-
rated by a blank or tab. The
and obviates the need for an explicit
symbols signal the macro-expansion of a
symbol. For example, the expression
{
and
}
in Figure 3.12 expands to any
positive number of occurrences of Blank, which is in turn defined as a single
space.
The first section can also include source code, delimited by%
{
Blank
}+
, that
is placed before the commands and regular expressions of section two. This
source code may include statements, as well as variable, procedure, and type
{
and%
}
 
 
Search WWH ::




Custom Search