Java Reference
In-Depth Information
Notice that r 0 = , the empty string of length 0; r 1 = r, r 2 = rr, r 3 = rrr, and so on;
r denotes an innite number of nite strings.
5. is a regular expression describing the language containing only the empty string.
6. Finally, if r is a regular expression, then (r) is also a regular expression denoting the
same language. The parentheses serve only for grouping.
Example. So, for example, given an alphabet f0; 1g,
1. 0 is a regular expression describing the single string 0
2. 1 is a regular expression describing the single string 1
3. 0j1 is a regular expression describing the language of two strings 0 and 1
4. (0j1) is a regular expression describing the (same) language of two strings 0 and 1
5. (0j1) is a regular expression describing the language of all strings, including the empty
string, of 1's and 0's: , 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, ..., 000111, ...
6. 1(0j1) is a regular expression describing the language of all strings of 1's and 0's that
start with a 1.
7. 0j1(0j1) is a regular expression describing the language consisting of all binary num-
bers (excluding those having unnecessary leading zeros).
Notice that there is an order of precedence in the construction of regular expressions:
repetition has the highest precedence, then concatenation, and finally alternation. So, 01
0j1 is equivalent to (0(1)0)j(1). Of course, parentheses may always be used to change the
grouping of sub-expressions.
Example. Given an alphabet fa;bg,
1. a(ajb) denotes the language of non-empty strings of a's and b's, beginning with an a
2. aajabjbajbb denotes the language of all two-symbol strings over the alphabet
3. (ajb)ab denotes the language of all strings of a's and b's, ending in ab (this includes
the string ab itself)
As in programming, we often find it useful to give names to things. For example, we can
dene D=1|2|3|4|5|6|7|8|9 . Then we can say 0|D(D|0)* denotes the language of natural
numbers, which is the same as 0|(1|2|3|4|5|6|7|8|9)(1|2|3|4|5|6|7|8|9|0)* .
There are all sorts of extensions to the notation of regular expressions, all of which are
shorthand for standard regular expressions. For example, in the Java Language Specifica-
tion [Gosling et al., 2005], [0-9] is shorthand for (0|1|2|3|4|5|6|7|8|9) , and [a-z] is
shorthand for (a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z) .
Other notations abound. For example, there are the POSIX extensions [IEEE, 2004],
which allow the square bracket notation above, ? for optional, +, , etc. JavaCC uses its
own notation. In our appendices, we use the notation used by [Gosling et al., 2005]. The
important thing is that all of these extensions are simply shorthand for regular expressions
that may be written using the notation described in Definition 2.1.
In describing the lexical tokens of programming languages, one uses some standard
input character set as one's alphabet, for example, the 128 character ASCII set, the 256
character extended ASCII set, or the much larger Unicode set. Java works with Unicode,
but aside from identifiers, characters, and string literals, all input characters are ASCII,
making implementations compatible with legacy operating systems. We do the same for j--.
 
Search WWH ::




Custom Search