Data Mining of Association Rules and the Process of Knowledge Discovery in Databases - Advances in Data Mining

Information Technology Reference

In-Depth Information

1)

F 1 = { frequent 1-itemsets }

2)

for(level = 2; F level-1 = ∅ ; ++level)

3)

{

4)

C level = generate candidates( F level − 1 );

5)

forall transactions T ∈D

6)

{

7)

forall C ⊆ T

8)

{

9)

if C ∈C level then

10)

++ C. count

11)

}

12)

}

13)

}

Fig. 3. Pseudocode of the Apriori algorithm [3]

repeated until no more candidates to count are generated. In other words, Apri-

ori starts with determining all frequent 2-itemsets, proceeds with all frequent

3-itemsets etc and stops as soon as it has reached a level that is completely

belowthe border from Figure 2.

Most other algorithms are variations of the Apriori principle. For instance

Partition [26] combines the breadth-first search of Apriori with determining the

support values of the candidates indirectly by set intersections. In order to be

able to keep all necessary transaction sets comfortably in main memory the

databasetypicallyneedstobepartitioned.ThealgorithmDIC[6]enhancesApri-

ori by relaxing the strict separation between candidate generation and counting

the candidates. Already during passing over the transactions newcandidates are

generated and added to the set of candidates on the fly. This helps to signifi-

cantly reduce the total number of the expensive passes over the database. Eclat

[33] combines a depth-first search strategy with intersections of transaction sets.

Due to depth-first search it is no longer guaranteed that all necessary infor-

mation for pruning candidates based on infrequent subsets is always available.

Accordingly Eclat has to live with larger candidate sets. Hybrid [15] employs a

special depth-first search strategy that does not have this limitation and more-

over profits from switching between breadth-first search and depth-first search

during lattice traversal.

Figure 4 shows a performance benchmark covering several state of the

art mining algorithms. The experiments were run on a PentiumIII clocked at

500Mhz. The algorithms were implemented in C++ and have proven their ef-

ficiency in several projects, c.f. [16,17,31]. On the logarithmically scaled x-axis

the threshold minsupp is lowered from 2% down to 0 . 125%. On the also log-

arithmically scaled y-axis we find the corresponding time for frequent pattern

generation. The employed dataset is a well known benchmark dataset first intro-

duced in [3]. Its average transaction size is 10, average frequent pattern size 4,

and it contains a total number of 1 million transactions. Obviously for none of

Advances in Data Mining

Search WWH ::

Custom Search

Home