Pattern-Growth Methods - Frequent Pattern Mining

Database Reference

In-Depth Information

Second, the pattern growth approach has been extended to pattern-based classifica-

tion. Frequent patterns have been demonstrated to be useful for classification, where

association rules are generated and analyzed for effective classification [ 21 , 22 ]by

discovery of strong associations between frequent patterns and class labels. More-

over, a further study [ 10 ] has provided solid reasoning to support the methodology

of frequent pattern-based classification. By building a connection between pattern

frequency and discriminative measures, such as information gain and Fisher score, it

is shown that discriminative frequent patterns are essential for classification, whereas

inclusion of infrequent patterns may not improve the classification accuracy due to

their limited predictive power. A pattern-growth based methodology, called DDP-

Mine [ 11 ], is developed that performs a branch-and-bound search for directly mining

discriminative patterns without generating the complete pattern set. Instead of select-

ing best patterns in a batch, a “feature-centered mining approach is introduced that

generates discriminative patterns sequentially on a progressively shrinking FP-tree by

incrementally eliminating training instances. The instance elimination effectively re-

duces the problem size iteratively and expedites the mining process. Empirical results

show that DDPMine achieves orders of magnitude speedup without downgrading

classification accuracy and outperforms the state-of-the-art associative classification

methods in terms of both accuracy and efficiency.

6

Conclusions

We have presented a pattern-growth methodology for mining multiple kinds of fre-

quent patterns in large databases. Their associated performance studies show that

the algorithms derived from the pattern-growth methodology are more efficient and

scalable than many other frequent pattern mining methods.

According to our analysis, the high performance of the pattern-growth method-

ology is due to the following factors: (1) it adopts a divide-and-conquer strategy to

project and partition a large database recursively into a set of progressively smaller

ones, and the patterns to be searched for in each corresponding projected database are

also reduced substantially; (2) it integrates disk-based database projection algorithms

with main memory-based data structures and fast in-memory traversal algorithms,

which can be well-tuned to achieve combined high performance by swapping disk-

based algorithm into memory-based one when the projected and compressed data

set can fit in memory; and (3) it makes good use of the Apriori property implicitly as

well as other properties, such as the single tree-path property, but avoids generating

a large number of candidates, which ensures each counting and testing is on the

real data sets rather than on the potential candidate sets. These several techniques

combined lead to high performance mining algorithms.

Search WWH ::

Custom Search

Home