Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

support counts can then be computed via id-list intersections. Zaki used lattice theory

[ 12 ] to show that any possible frequent sequence can be obtained as a union or join

of shorter frequent sequences. He proposed both breath-first-search and depth-first-

search approaches to recursively decompose the lattice into prefix-based classes,

providing an alternative to the breath-first-search approach of GSP.

Han et al. [ 20 ] introduced Frequent pattern-projected Sequential pattern mining

(FreeSpan), which Pei et al. [ 43 ] followed with Prefix-projected Sequential pattern

mining (PrefixSpan). FreeSpan and PrefixSpan recursively partition the sequence

data based on frequent items and sequence prefixes, respectively. They first scan

the database and derive frequent items. Then, projected databases are constructed

for each of the frequent items, or length-1 frequent sequences, eliminating those

sequences that only contain infrequent items. By choosing to partition on sequence

prefixes, PrefixSpan is able to find all frequent sequences by examining only prefix

sub-sequences and projecting only their corresponding postfix sub-sequences into

projected databases. Pei et al. also explore pseudo-projections (when data fits in

main memory) and bi-level projections in PrefixSpan, using the downward closure

property to prune items in projected databases.

Guralnik and Karypis [ 18 ] altered a tree projection algorithm for discovering

frequent itemsets by Agarwal et al. [ 4 ] to mine frequent sequences. As in the FIM

approach, a lexicographic projection tree is grown through bi-level candidate se-

quence generation (nodes at level k

1). Each

node in the tree represents a frequent sequence, which can be extended either by

adding a lexicographically correct item in the last element of the sequence ( itemset

extension ) or a new itemset/element to the sequence ( sequence extension ). Each ac-

tive node, one that still has a possibility of being extended, maintains four sparse

count matrices used to efficiently update frequencies during the projection phase.

In the context of Web access mining, Pei et al. [ 42 ] developed Web Access Pattern

miner (WAP-miner) for mining symbol sequences constructed from Web logs. The

algorithm uses a Web access pattern tree ( WAP-tree ) as a compressed representation

of the sequence database and a conditional search tree-projection mechanism for

growing sequence patterns. Rajimol and Raju [ 46 ] surveyed several WAP-tree based

extensions which improve the data structure and projection mechanism of WAP-

miner.

While finding all frequent patterns is costly, the majority of the resulting pat-

terns may not be interesting. CloSpan (Yan et al. [ 61 ]) and BIDE (Wang and Han

[ 56 ]) focus on finding frequent closed sequences . Though following a candidate

generate-and-test methodology, CloSpan prunes more of the search space through

CommonPrefix and Backward Sub-Pattern pruning . It also speeds up testing by not-

ing that projected databases with the same number of items are equivalent. BIDE

uses a paradigm called BI-Directional Extension to both check closure of candidate

sequences and prune the search space.

As a related problem first studied by Mannila et al. [ 36 ], frequent episode mining

finds frequent collections of events that occur relatively close to each other in a long

symbol sequence, given a partial event order. Joshi et al. [ 26 ] developed a universal

formulation for sequence patterns, encompassing both Mannila et al.'s definition of

−

1 generate candidates for level k

+

Frequent Pattern Mining

Search WWH ::

Custom Search

Home