Frequent Pattern Mining Algorithms: A Survey - Frequent Pattern Mining

Database Reference

In-Depth Information

passes comes at an increased computational cost. Therefore, it was proposed in [ 1 ]

that the approach should be used for later passes, when the number of candidates

has already reduced significantly. This reduces the likelihood that the number of

candidates blows up too much with this approach.

7.2

Sampling Tricks

A number of sampling tricks can be used to greatly improve the efficiency of the

frequent pattern mining process. Most sampling methods require two passes over

the data, the first of which is used for sampling. An interesting approach that uses

two passes with the use of sampling is discussed in [ 65 ]. This method generates the

approximately frequent patterns over the data, using a sample. False negatives can

be reduced by lowering the minimum support level appropriately, so that bounds can

be defined on the likelihood of false negatives. False positives can be removed with

the use of a second pass over the data. The major downside of the approach is that

the reduction in the minimum support level to reduce the number of false negatives

can be significant. This also reduces the computational efficiency of the approach.

The method however requires only two passes over the data, where the first pass is

used to create the sample, and the second pass is used to remove the false positives.

An interesting approach proposed in [ 57 ] divides the disk resident database into

smaller memory-resident partitions. For each partition, more efficiency algorithms

can be used, because of the memory-resident nature of the partition. It should be

pointed out that each frequent pattern over the entire database will appear as a fre-

quent pattern in at least one transaction. Therefore, the union of the itemsets over

the different transactions provides a superset of the true frequent patterns. A post-

processing phase is then used to filter out the spurious itemsets, by counting this

candidate set against the transaction database. As long as the partitions are reason-

ably large, the superset found approximates the true frequent patterns very well,

and therefore the additional time spent in counting irrelevant candidates is relatively

small. The main advantage of this approach is it requires only two passes over the

database. Therefore, such an approach is particularly effective when the data is

resident on disk.

The Dynamic Itemset Counting (DIC) algorithm [ 15 ] divides the database into

intervals, and generates longer candidates when it is known that the subsets of these

candidates are already frequent. These are then validated over the database. Such

an approach can reduce the number of passes over the data, because it implicitly

combines the process of candidate generation and counting.

Search WWH ::

Custom Search

Home