Database Reference
In-Depth Information
Table 3.5 A sequence
database
Sequence_id
Sequence
10
a ( abc )( ac ) d ( cf )
20
( ad ) c ( bc )( ae )
30
( ef )( ab )( df ) cb
40
eg ( af ) cbc
the database, i.e., suppor t S ( α )
ξ . A sequential pattern with length l is called an
l-pattern .
Given a sequence database and a min_support threshold, the problem of sequential
pattern mining is to find the complete set of sequential patterns in the database.
Sequential pattern mining is more challenging than mining frequent itemsets. Se-
quences allow multiple occurrences of items and combination of items into itemsets,
which may lead to a combination explosion. For example, using items a and b , there
are only three possible itemsets: a , b and ab . However, even the length of sequences
is limited to 3, there are 12 possible sequences:
aaa
,
aab
, ... ,
bbb
,
( ab ) a
,
... ,
.
Sequential patterns also have the Apriori property : every non-empty sub-sequence
of a sequential pattern is a sequential pattern . A typical sequential pattern mining
algorithm, GSP [ 35 ], is based on this Apriori property to reduce search space. How-
ever, the method bears similar non-trivial, inherent costs as to Apriori in mining
frequent itemsets.
Following the similar philosophy of frequent pattern growth, two algorithms,
FreeSpan [ 17 ] and PrefixSpan [ 30 ], are developed for pattern growth-based sequen-
tial pattern mining. FreeSpan mines sequential patterns by projecting the sequence
database based on any frequent subsequences and growing subsequences in any
position; whereas PrefixSpan does it by projecting the database based on only the
frequent prefix subsequences and adding postfixes in the growth. Both methods find
the complete set of sequential patterns but the latter is more efficient since it involves
less database projections and less subsequence combinations to be examined. This
analysis has also been verified by the performance results, and thus we examine only
PrefixSpan using an example.
b ( ab )
Example 5 (PrefixSpan) Suppose we want to mine sequential patterns in a sequence
database S , shown in Table 3.5 , with the support threshold set to 2. PrefixSpan works
as follows.
First, we find length
1 sequential patterns by scanning S once. This derives
the set of frequent items in sequences, i.e., the set of length-1 sequential patterns:
{(
: 3)}.
Then, the search space can be partitioned into the following six subsets :
(1) the ones with prefix
a
: 4), (
b
: 4), (
c
: 4), (
d
: 3), (
e
: 3), and (
f
. The subsets of
sequential patterns can be mined by constructing corresponding projected databases
and mine each recursively. The projected databases as well as sequential patterns
found in them are listed in Table 3.6 , and the mining process is explained as follows.
a
; ... ; and (6) the ones with prefix
f
 
Search WWH ::




Custom Search