Database Reference
In-Depth Information
α 1 is frequent since the first δ customer sequences in the k-sorted database take α 1 as
their k-minimum subsequence, and if α 1 is smaller than α δ , then α 1 is non-frequent
and all subsequences up to and including α δ can be skipped. This process is then re-
peated for the next k-minimum subsequence in the resorted k-sorted database. Thus,
the DISC strategy uses a k-sorted database to find all the frequent k-sequences and
skips most non-frequent k-sequences by checking only the conditional k-minimum
subsequences. It uses a partitioning method similar to PrefixSpan for generating fre-
quent 2-sequences and 3-sequences, and then employs the DISC strategy to generate
the remaining frequent sequences.
5.5
Approximate Methods
Conventional sequential pattern mining methods may meet inherent difficulties in
mining databases with long sequences and noise. They may generate a huge number
of short and trivial patterns but fail to find interesting patterns approximately shared
by many sequences. In [ 37 ], Kum et al. proposed the theme of approximate sequential
pattern mining . The general idea is that, instead of finding exact patterns, they identify
patterns approximately shared by many sequences. Instead of mining a huge set of
patterns, they propose to mine consensus patterns from databases of long sequences.
Intuitively, a consensus pattern is shared by many sequences and covers many short
patterns. To mine consensus sequential patterns from large databases, Kum et al.
developed an efficient algorithm ApproxMAP, a cluster and multiple alignment based
approach, which works in two steps. First, sequences in a database are clustered
based on similarity. Sequences in the same cluster may approximately follow some
similar patterns. To enable the clustering of sequences, a modified version of the
hierarchical edit distance metric is used in a density-based clustering algorithm.
Then, the longest approximate sequential pattern for each cluster is generated. It
is called the consensus pattern. To extract consensus patterns, a weighted sequence
which records the statistics of the alignment of the sequences is derived for each
cluster using multiple alignment to compress the sequential pattern information in
the cluster. And then the longest consensus pattern best representing the cluster is
generated from the weighted sequence.
5.6
Top-k Closed Sequential Pattern Mining
Mining closed sequential patterns may significantly reduce the number of patterns
generated and is information lossless because it can be used to derive the complete
set of sequential patterns. However, setting min_support is a subtle task: a too small
value may lead to the generation of thousands of patterns, whereas a too big one may
lead to no answer found. To come up with an appropriate min_support , one needs
prior knowledge about the mining query and the task-specific data, and be able to
Search WWH ::




Custom Search