Database Reference
In-Depth Information
Table 11.1
A sequence
database
Sequence_id
Sequence
1
a
(
abc
)(
ac
)
d
(
cf
)
2
(
ad
)
c
(
bc
)(
ae
)
3
(
ef
)(
ab
)(
df
)
cb
4
eg
(
af
)
cbc
2
Problem Definition
This section presents the formal definition of the sequential pattern mining problem,
and its associated notations.
Let
I
be a set of all
items
, which comprise the alphabet. An
itemset
(or
event
) is a subset of items and denoted by (
i
1
i
2
···
={
i
1
,
i
2
,
...
,
i
n
}
i
m
), where
i
k
is an
item. It is assumed that items in an itemset are sorted in lexicographic order. A
sequence
is an ordered list of itemsets. A sequence
s
is denoted by
,
where
s
j
is an itemset. For brevity, the brackets are omitted if an itemset has only
one item, i.e., itemset (
i
) is written as
i
. An item can occur at most once in an itemset
of a sequence, but can occur multiple times in different itemsets of a sequence. The
number of instances of items in a sequence is called the
length
of the sequence. A
sequence with length
l
is called an
l
-sequence. A sequence
α
s
1
s
2
···
s
l
=
a
1
a
2
···
a
n
is called
a subsequence of another sequence
β
=
b
1
b
2
···
b
m
and
β
is a super-sequence of
α
, denoted by
α
β
, if there exist integers 1
≤
j
1
<j
2
<
···
<j
n
≤
m
such that
a
1
⊆
b
j
n
.
A
sequence database D
is a set of tuples
b
j
1
,
a
2
⊆
b
j
2
,
...
,
a
n
⊆
sid
,
s
, where
sid
is a sequence_id and
s
is a sequence. A tuple
is said to contain a sequence
α
,if
α
is a subsequence of
s
. The
support
of a sequence
α
in a sequence database
D
, denoted by
suppor t
D
(
α
), is
the number of tuples in the database containing
α
. It can be denoted by
suppor t
(
α
)if
the sequence database is clear from the context. Given a positive integer
min_support
as the
minimum support threshold
, a sequence
α
is said to be
frequent
and called
a
sequential pattern
in sequence database
D
if
suppor t
D
(
α
)
sid
,
s
min
_
suppor t
.A
sequential pattern with length
l
is called an
l
-pattern. The set of frequent
l
-sequences
is denoted by
≥
F
l
. If there exists no proper super-sequence of a sequential pattern
α
with the same support as
α
,
α
is called a
closed sequential pattern
(or a
frequent
closed subsequence
) in sequence database
D
. Furthermore, a sequential pattern
α
is
called a
maximal sequential pattern
(or a
frequent maximal subsequence
)ifitisnot
contained in any other sequential pattern.
Formally, given a sequence database
D
and the minimum support threshold
min_support
, the problems of sequential pattern mining, closed sequential pattern
mining, and maximal sequential pattern mining, are to find the set of all frequent
subsequences, all frequent closed subsequences, and all frequent maximal subse-
quences from the input sequence database
D
, respectively. To estimate the upper
bound on the number of sequential patterns given a sequence database, Raïssi and
Pei proposed two novel techniques in [
18
].
A sequence database
D
is given in Table
11.1
and
min_support
=2. The set of
items in the database
D
is
{
a
,
b
,
c
,
d
,
e
,
f
,
g
}
. A sequence
a
(
abc
)(
ac
)
d
(
cf
)
has