Java Reference
In-Depth Information
via
catenation
(that is,
by joining individual characters to form a string). As characters are catenated
to a string, it grows in length. For example, the string do is built by first
catenating d to
Strings are built fromcharacters in the character set
Σ
and then catenating o to the string d. The null string, when
catenated with any string
s
, yields
s
.Thatis,
s
λ≡λ
s
≡
s
. Catenating
λ
λ
to a
string is like adding 0 to an integer—nothing changes.
Catenation is extended to sets of strings as follows. Let
P
and
Q
be sets of
strings. The symbol
∈
represents set membership. If
s
1
∈
P
and
s
2
∈
Q
,then
string
s
1
s
2
(
PQ
). Small finite sets are conveniently represented by listing
their elements, which can be individual characters or strings of characters.
Parentheses are used to delimit expressions, and
∈
, the alternation operator, is
used to separate alternatives. For example, D, the set of the ten single digits,
is defined as
D
=
|
(0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9).
In this text, we often
use abbreviations such as (0
|...|
9) rather than enumerate a complete list of
alternatives. The
symbol is
not
part of our regular expression notation.
A
meta-character
is any punctuation character or regular expression oper-
ator. A meta-character must be quoted when used as an ordinary character in
order to avoid ambiguity. (Any character or string may be quoted, but unnec-
essary quotation is avoided to enhance readability.) The following six symbols
are meta-characters: ( ) ' *
...
, ) defines four single
character tokens (left parenthesis, right parenthesis, semicolon, and comma)
that we might use in a programming language. The parentheses are quoted
to show they are meant to be individual tokens and not delimiters in a larger
regular expression.
Alternation can be extended to sets of strings. Let
P
and
Q
be sets of
strings. Then string
s
∈
+|
. The expression ( '('
|
')'
|
;
|
(
P
|
Q
) if, and only if,
s
∈
P
or
s
∈
Q
. For example, if
LC
is the set of lowercase letters and
UC
is the set of uppercase letters, then
(
LC
|
UC
) denotes the set of all letters (in either case).
Large (or infinite) sets are conveniently represented by operations on finite
sets of characters and strings. Catenation and alternationmay be used. A third
operation,
Kleene closure
, as defined below, is also allowed. The operator
is the postfix
Kleene closure operator
. For example, let
P
be a set of strings. Then
P
represents all strings formed by the catenation of zero or more selections
(possibly repeated) from
P
. (Zero selections are represented by
.) For exam-
ple,
LC
is the set of all words composed only of lowercase letters and of any
length (including the zero-length word,
λ
).
Precisely stated, a string
s
∈
P
if, and only if,
s
can be broken into zero or
more pieces:
λ
s
=
s
1
s
2
...
s
n
such that each
s
i
∈
P
(
n
≥
0
,
1
≤
i
≤
n
). We explicitly
is always in
P
.
Now that we have introduced the operators used in regular expressions,
we can define regular expressions as follows:
•∅
allow
n
=
0sothat
λ
is a regular expression denoting the empty set (the set containing no
strings).
∅
is rarely used but is included for completeness.