Sequential Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

α 1 is frequent since the first δ customer sequences in the k-sorted database take α 1 as

their k-minimum subsequence, and if α 1 is smaller than α δ , then α 1 is non-frequent

and all subsequences up to and including α δ can be skipped. This process is then re-

peated for the next k-minimum subsequence in the resorted k-sorted database. Thus,

the DISC strategy uses a k-sorted database to find all the frequent k-sequences and

skips most non-frequent k-sequences by checking only the conditional k-minimum

subsequences. It uses a partitioning method similar to PrefixSpan for generating fre-

quent 2-sequences and 3-sequences, and then employs the DISC strategy to generate

the remaining frequent sequences.

5.5

Approximate Methods

Conventional sequential pattern mining methods may meet inherent difficulties in

mining databases with long sequences and noise. They may generate a huge number

of short and trivial patterns but fail to find interesting patterns approximately shared

by many sequences. In [ 37 ], Kum et al. proposed the theme of approximate sequential

pattern mining . The general idea is that, instead of finding exact patterns, they identify

patterns approximately shared by many sequences. Instead of mining a huge set of

patterns, they propose to mine consensus patterns from databases of long sequences.

Intuitively, a consensus pattern is shared by many sequences and covers many short

patterns. To mine consensus sequential patterns from large databases, Kum et al.

developed an efficient algorithm ApproxMAP, a cluster and multiple alignment based

approach, which works in two steps. First, sequences in a database are clustered

based on similarity. Sequences in the same cluster may approximately follow some

similar patterns. To enable the clustering of sequences, a modified version of the

hierarchical edit distance metric is used in a density-based clustering algorithm.

Then, the longest approximate sequential pattern for each cluster is generated. It

is called the consensus pattern. To extract consensus patterns, a weighted sequence

which records the statistics of the alignment of the sequences is derived for each

cluster using multiple alignment to compress the sequential pattern information in

the cluster. And then the longest consensus pattern best representing the cluster is

generated from the weighted sequence.

5.6

Top-k Closed Sequential Pattern Mining

Mining closed sequential patterns may significantly reduce the number of patterns

generated and is information lossless because it can be used to derive the complete

set of sequential patterns. However, setting min_support is a subtle task: a too small

value may lead to the generation of thousands of patterns, whereas a too big one may

lead to no answer found. To come up with an appropriate min_support , one needs

prior knowledge about the mining query and the task-specific data, and be able to

Search WWH ::

Custom Search

Home