Frequent Pattern Mining Algorithms: A Survey - Frequent Pattern Mining

Database Reference

In-Depth Information

2.1

Apriori Method

The most basic join-based algorithm is the Apriori method [ 1 ]. The Apriori approach

uses a level-wise approach in which all frequent itemsets of length k are generated

before those of length ( k

1). The main observation which is used for the Apriori

algorithm is that every subset of a frequent pattern is also frequent. Therefore, can-

didates for frequent patterns of length ( k

1) can be generated from known frequent

patterns of length k with the use of joins. A join is defined by pairs of frequent k -

patterns that have at least ( k

−

1) items in common. Specifically, consider a frequent

pattern

that is frequent, but has not yet been discovered because only

itemsets of length 3 have been discovered so far. In this case, because the patterns

{

i 1 , i 2 , i 3 , i 4 }

i 1 , i 2 , i 3 }

and

{

i 1 , i 2 , i 4 }

are frequent, they will be present in the set

F 3 of all frequent

patterns with length k

2 items in

common. By performing a join on this pair, it is possible to create the candidate pat-

tern

3. Note that this particular pair also has k

−

. This pattern is referred to as a candidate because it might possibly

be frequent, and one most either rule it in or rule it out by support counting. There-

fore, this candidate is then validated against the transaction database by counting its

support. Clearly, the design of an efficient support counting method plays a critical

role in the overall efficiency of the process. Furthermore, it is important to note that

the same candidate can be produced by joining multiple frequent patterns. For ex-

ample, one might join

{

i 1 , i 2 , i 3 , i 4 }

to achieve the same result. Therefore,

in order to avoid duplication in candidate generation, two itemsets are joined only

whether first ( k −

{

i 1 , i 2 , i 3 }

and

{

i 2 , i 3 , i 4 }

1) items are the same, based on a lexicographic ordering imposed

on the items. This provides all the ( k +

1)-candidates in a non-redundant way.

It should be pointed out that some candidates can be pruned out in an efficient way,

without validating them against the transaction database. For any ( k

1)-candidates,

it is checked whether all its k subsets are frequent. Although it is already known that

two of its subsets contributing to the join are frequent, it is not known whether its

remaining subsets are frequent. If all its subsets are not frequent, then the candidate

can be pruned from consideration because of the downward closure property. This is

known as the Apriori pruning trick. For example, in the previous case, if the itemset

{

does not exist in the set of frequent 3-itemsets which have already been

found, then the candidate itemset

i 1 , i 3 , i 4 }

can be pruned from consideration with

no further computational effort. This greatly speeds up the overall algorithm. The

generation of 1-itemsets and 2-itemsets is usually performed in a specialized way

with more efficient techniques.

Therefore, the basic Apriori algorithm can be described recursively in level-wise

fashion. the overall algorithm comprises of three steps that are repeated over and

over again, for different values of k , where k is the length of the pattern generated in

the current iteration. The four steps are those of (i) generation of candidate patterns

C k + 1 by using joins on the patterns in

{

i 1 , i 2 , i 3 , i 4 }

F k , (ii) the pruning of candidates from

C k + 1 ,

F k , and (iii) the validation of the patterns in

C k + 1

for which all subsets to not lie in

against the transaction database

C k + 1 which is truly

frequent. The algorithm is terminated, when the set of frequent k -patterns

, to determine the subset of

F k in a

given iteration is empty. The pseudo-code of the overall procedure is presented in

Fig. 2.3 .

Frequent Pattern Mining

Search WWH ::

Custom Search

Home