Java Reference
In-Depth Information
via catenation (that is,
by joining individual characters to form a string). As characters are catenated
to a string, it grows in length. For example, the string do is built by first
catenating d to
Strings are built fromcharacters in the character set
Σ
and then catenating o to the string d. The null string, when
catenated with any string s , yields s .Thatis, s λ≡λ s s . Catenating
λ
λ
to a
string is like adding 0 to an integer—nothing changes.
Catenation is extended to sets of strings as follows. Let P and Q be sets of
strings. The symbol
represents set membership. If s 1
P and s 2
Q ,then
string s 1 s 2
( PQ ). Small finite sets are conveniently represented by listing
their elements, which can be individual characters or strings of characters.
Parentheses are used to delimit expressions, and
, the alternation operator, is
used to separate alternatives. For example, D, the set of the ten single digits,
is defined as D =
|
(0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9).
In this text, we often
use abbreviations such as (0
|...|
9) rather than enumerate a complete list of
alternatives. The
symbol is not part of our regular expression notation.
A meta-character is any punctuation character or regular expression oper-
ator. A meta-character must be quoted when used as an ordinary character in
order to avoid ambiguity. (Any character or string may be quoted, but unnec-
essary quotation is avoided to enhance readability.) The following six symbols
are meta-characters: ( ) ' *
...
, ) defines four single
character tokens (left parenthesis, right parenthesis, semicolon, and comma)
that we might use in a programming language. The parentheses are quoted
to show they are meant to be individual tokens and not delimiters in a larger
regular expression.
Alternation can be extended to sets of strings. Let P and Q be sets of
strings. Then string s
+|
. The expression ( '('
|
')'
|
;
|
( P | Q ) if, and only if, s P or s Q . For example, if
LC is the set of lowercase letters and UC is the set of uppercase letters, then
( LC | UC ) denotes the set of all letters (in either case).
Large (or infinite) sets are conveniently represented by operations on finite
sets of characters and strings. Catenation and alternationmay be used. A third
operation, Kleene closure , as defined below, is also allowed. The operator
is the postfix Kleene closure operator . For example, let P be a set of strings. Then
P represents all strings formed by the catenation of zero or more selections
(possibly repeated) from P . (Zero selections are represented by
.) For exam-
ple, LC is the set of all words composed only of lowercase letters and of any
length (including the zero-length word,
λ
).
Precisely stated, a string s P if, and only if, s can be broken into zero or
more pieces:
λ
s = s 1 s 2
... s n such that each s i P ( n
0
,
1
i n ). We explicitly
is always in P .
Now that we have introduced the operators used in regular expressions,
we can define regular expressions as follows:
•∅
allow n =
0sothat
λ
is a regular expression denoting the empty set (the set containing no
strings).
is rarely used but is included for completeness.
 
Search WWH ::




Custom Search