Lexical Analysis - Introduction to Compiler Construction in a Java World

Java Reference

In-Depth Information

Notice that r 0 = , the empty string of length 0; r 1 = r, r 2 = rr, r 3 = rrr, and so on;

r denotes an innite number of nite strings.

5. is a regular expression describing the language containing only the empty string.

6. Finally, if r is a regular expression, then (r) is also a regular expression denoting the

same language. The parentheses serve only for grouping.

Example. So, for example, given an alphabet f0; 1g,

1. 0 is a regular expression describing the single string 0

2. 1 is a regular expression describing the single string 1

3. 0j1 is a regular expression describing the language of two strings 0 and 1

4. (0j1) is a regular expression describing the (same) language of two strings 0 and 1

5. (0j1) is a regular expression describing the language of all strings, including the empty

string, of 1's and 0's: , 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, ..., 000111, ...

6. 1(0j1) is a regular expression describing the language of all strings of 1's and 0's that

start with a 1.

7. 0j1(0j1) is a regular expression describing the language consisting of all binary num-

bers (excluding those having unnecessary leading zeros).

Notice that there is an order of precedence in the construction of regular expressions:

repetition has the highest precedence, then concatenation, and finally alternation. So, 01

0j1 is equivalent to (0(1)0)j(1). Of course, parentheses may always be used to change the

grouping of sub-expressions.

Example. Given an alphabet fa;bg,

1. a(ajb) denotes the language of non-empty strings of a's and b's, beginning with an a

2. aajabjbajbb denotes the language of all two-symbol strings over the alphabet

3. (ajb)ab denotes the language of all strings of a's and b's, ending in ab (this includes

the string ab itself)

As in programming, we often find it useful to give names to things. For example, we can

dene D=1|2|3|4|5|6|7|8|9 . Then we can say 0|D(D|0)* denotes the language of natural

numbers, which is the same as 0|(1|2|3|4|5|6|7|8|9)(1|2|3|4|5|6|7|8|9|0)* .

There are all sorts of extensions to the notation of regular expressions, all of which are

shorthand for standard regular expressions. For example, in the Java Language Specifica-

tion [Gosling et al., 2005], [0-9] is shorthand for (0|1|2|3|4|5|6|7|8|9) , and [a-z] is

shorthand for (a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z) .

Other notations abound. For example, there are the POSIX extensions [IEEE, 2004],

which allow the square bracket notation above, ? for optional, +, , etc. JavaCC uses its

own notation. In our appendices, we use the notation used by [Gosling et al., 2005]. The

important thing is that all of these extensions are simply shorthand for regular expressions

that may be written using the notation described in Definition 2.1.

In describing the lexical tokens of programming languages, one uses some standard

input character set as one's alphabet, for example, the 128 character ASCII set, the 256

character extended ASCII set, or the much larger Unicode set. Java works with Unicode,

but aside from identifiers, characters, and string literals, all input characters are ASCII,

making implementations compatible with legacy operating systems. We do the same for j--.

Search WWH ::

Custom Search

Home