Database Reference
In-Depth Information
Table 11.1 A sequence
database
Sequence_id
Sequence
1
a ( abc )( ac ) d ( cf )
2
( ad ) c ( bc )( ae )
3
( ef )( ab )( df ) cb
4
eg ( af ) cbc
2
Problem Definition
This section presents the formal definition of the sequential pattern mining problem,
and its associated notations.
Let I
be a set of all items , which comprise the alphabet. An
itemset (or event ) is a subset of items and denoted by ( i 1 i 2 ···
={
i 1 , i 2 , ... , i n }
i m ), where i k is an
item. It is assumed that items in an itemset are sorted in lexicographic order. A
sequence is an ordered list of itemsets. A sequence s is denoted by
,
where s j is an itemset. For brevity, the brackets are omitted if an itemset has only
one item, i.e., itemset ( i ) is written as i . An item can occur at most once in an itemset
of a sequence, but can occur multiple times in different itemsets of a sequence. The
number of instances of items in a sequence is called the length of the sequence. A
sequence with length l is called an l -sequence. A sequence α
s 1 s 2 ···
s l
=
a 1 a 2 ···
a n
is called
a subsequence of another sequence β
=
b 1 b 2 ···
b m
and β is a super-sequence of
α , denoted by α
β , if there exist integers 1
j 1 <j 2 <
···
<j n
m such that
a 1
b j n .
A sequence database D is a set of tuples
b j 1 , a 2
b j 2 , ... , a n
sid , s
, where sid is a sequence_id and s
is a sequence. A tuple
is said to contain a sequence α ,if α is a subsequence of
s . The support of a sequence α in a sequence database D , denoted by suppor t D ( α ), is
the number of tuples in the database containing α . It can be denoted by suppor t ( α )if
the sequence database is clear from the context. Given a positive integer min_support
as the minimum support threshold , a sequence α is said to be frequent and called
a sequential pattern in sequence database D if suppor t D ( α )
sid , s
min _ suppor t .A
sequential pattern with length l is called an l -pattern. The set of frequent l -sequences
is denoted by
F l . If there exists no proper super-sequence of a sequential pattern α
with the same support as α , α is called a closed sequential pattern (or a frequent
closed subsequence ) in sequence database D . Furthermore, a sequential pattern α is
called a maximal sequential pattern (or a frequent maximal subsequence )ifitisnot
contained in any other sequential pattern.
Formally, given a sequence database D and the minimum support threshold
min_support , the problems of sequential pattern mining, closed sequential pattern
mining, and maximal sequential pattern mining, are to find the set of all frequent
subsequences, all frequent closed subsequences, and all frequent maximal subse-
quences from the input sequence database D , respectively. To estimate the upper
bound on the number of sequential patterns given a sequence database, Raïssi and
Pei proposed two novel techniques in [ 18 ].
A sequence database D is given in Table 11.1 and min_support =2. The set of
items in the database D is
{
a , b , c , d , e , f , g
}
. A sequence
a ( abc )( ac ) d ( cf )
has
 
Search WWH ::




Custom Search