Pattern-Growth Methods - Frequent Pattern Mining

Database Reference

In-Depth Information

Table 3.5 A sequence

database

Sequence_id

Sequence

a ( abc )( ac ) d ( cf )

( ad ) c ( bc )( ae )

( ef )( ab )( df ) cb

eg ( af ) cbc

the database, i.e., suppor t S ( α )

≥

ξ . A sequential pattern with length l is called an

l-pattern .

Given a sequence database and a min_support threshold, the problem of sequential

pattern mining is to find the complete set of sequential patterns in the database.

Sequential pattern mining is more challenging than mining frequent itemsets. Se-

quences allow multiple occurrences of items and combination of items into itemsets,

which may lead to a combination explosion. For example, using items a and b , there

are only three possible itemsets: a , b and ab . However, even the length of sequences

is limited to 3, there are 12 possible sequences:

aaa

aab

, ... ,

bbb

( ab ) a

... ,

Sequential patterns also have the Apriori property : every non-empty sub-sequence

of a sequential pattern is a sequential pattern . A typical sequential pattern mining

algorithm, GSP [ 35 ], is based on this Apriori property to reduce search space. How-

ever, the method bears similar non-trivial, inherent costs as to Apriori in mining

frequent itemsets.

Following the similar philosophy of frequent pattern growth, two algorithms,

FreeSpan [ 17 ] and PrefixSpan [ 30 ], are developed for pattern growth-based sequen-

tial pattern mining. FreeSpan mines sequential patterns by projecting the sequence

database based on any frequent subsequences and growing subsequences in any

position; whereas PrefixSpan does it by projecting the database based on only the

frequent prefix subsequences and adding postfixes in the growth. Both methods find

the complete set of sequential patterns but the latter is more efficient since it involves

less database projections and less subsequence combinations to be examined. This

analysis has also been verified by the performance results, and thus we examine only

PrefixSpan using an example.

b ( ab )

Example 5 (PrefixSpan) Suppose we want to mine sequential patterns in a sequence

database S , shown in Table 3.5 , with the support threshold set to 2. PrefixSpan works

as follows.

First, we find length

1 sequential patterns by scanning S once. This derives

the set of frequent items in sequences, i.e., the set of length-1 sequential patterns:

{(

−

: 3)}.

Then, the search space can be partitioned into the following six subsets :

(1) the ones with prefix

: 4), (

: 3), (

: 3), and (

. The subsets of

sequential patterns can be mined by constructing corresponding projected databases

and mine each recursively. The projected databases as well as sequential patterns

found in them are listed in Table 3.6 , and the mining process is explained as follows.

; ... ; and (6) the ones with prefix

Frequent Pattern Mining

Search WWH ::

Custom Search

Home