Frequent Pattern Mining Algorithms: A Survey - Frequent Pattern Mining

Database Reference

In-Depth Information

T k

Because each newly created transaction in

is much shorter,

this makes

subsequent support counting more efficient.

In some cases, no candidate may be a subset of the transaction. Such a transaction

can be dropped from the database because it does not contribute to the counting

of support values.

In other cases, more than one candidate may be a subset of the transaction, which

will actually increase the overhead of the algorithm. Clearly, this is not a desirable

scenario.

Thus, the first two factors improve the efficiency of the new representation, whereas

the last factor worsens it. Typically, the impact of the last factor is greater in the early

iterations, whereas the impact of the first two factors is greater in the later iterations.

Therefore, to maximize the overall efficiency, a natural approach would be to not

use this optimization in the early iterations, and apply it only in the later iterations.

This variation is referred to as the AprioriHybrid algorithm [ 1 ]. Another optimization

proposed in [ 9 ] is that the support of many patterns can be inferred from those of

key patterns in the data. This is used to significantly enhance the efficiency of the

approach.

Numerous other techniques have been proposed that use different techniques to

optimize the original implementation of the Apriori algorithm. As an example, the

method in [ 1 ] and [ 44 ] share a number of similarities but are somewhat different at

the implementation level. A work that combines the ideas from these different pieces

of work is presented in [ 3 ].

2.2

DHP Algorithm

The DHP algorithm, also known as the Direct Hashing and Pruning method [ 50 ],

was proposed soon after the Apriori method. It proposes two main optimizations to

speed up the algorithm. The first optimization is to prune the candidate itemsets in

each iteration, and the second optimization is to trim the transactions to make the

support-counting process more efficient.

To prune the itemsets, the algorithm tracks partial information about candidate

( k

1)-itemsets, while explicitly counting the support of candidate k -itemsets. During

the counting of candidate k -itemsets, all ( k

+

1) subsets of the transaction are found

and hashed into a table that maintains the counts of the number of subsets hashed

into each entry. During the phase of counting ( k

+

1)-itemsets, the counts in the hash

table are retrieved for each itemset. Clearly, these counts are overestimates because

of possible collisions in the hash table. Those itemsets for which the counts are below

the user-specified support level are then pruned from consideration.

A second optimization proposed in DHP is that of transaction trimming. A key

observation here is that if an item does not appear in at least k frequent itemsets in

F k , then no frequent itemset in

+

F k + 1 will contain that item. This follows from the fact

that there should be at least k (immediate) subsets of each frequent pattern in

F k + 1

Frequent Pattern Mining

Search WWH ::

Custom Search

Home