Information Technology Reference
In-Depth Information
1)
F 1 = { frequent 1-itemsets }
2)
for(level = 2; F level-1 = ; ++level)
3)
{
4)
C level = generate candidates( F level 1 );
5)
forall transactions T ∈D
6)
{
7)
forall C ⊆ T
8)
{
9)
if C ∈C level then
10)
++ C. count
11)
}
12)
}
13)
}
Fig. 3. Pseudocode of the Apriori algorithm [3]
repeated until no more candidates to count are generated. In other words, Apri-
ori starts with determining all frequent 2-itemsets, proceeds with all frequent
3-itemsets etc and stops as soon as it has reached a level that is completely
belowthe border from Figure 2.
Most other algorithms are variations of the Apriori principle. For instance
Partition [26] combines the breadth-first search of Apriori with determining the
support values of the candidates indirectly by set intersections. In order to be
able to keep all necessary transaction sets comfortably in main memory the
databasetypicallyneedstobepartitioned.ThealgorithmDIC[6]enhancesApri-
ori by relaxing the strict separation between candidate generation and counting
the candidates. Already during passing over the transactions newcandidates are
generated and added to the set of candidates on the fly. This helps to signifi-
cantly reduce the total number of the expensive passes over the database. Eclat
[33] combines a depth-first search strategy with intersections of transaction sets.
Due to depth-first search it is no longer guaranteed that all necessary infor-
mation for pruning candidates based on infrequent subsets is always available.
Accordingly Eclat has to live with larger candidate sets. Hybrid [15] employs a
special depth-first search strategy that does not have this limitation and more-
over profits from switching between breadth-first search and depth-first search
during lattice traversal.
Figure 4 shows a performance benchmark covering several state of the
art mining algorithms. The experiments were run on a PentiumIII clocked at
500Mhz. The algorithms were implemented in C++ and have proven their ef-
ficiency in several projects, c.f. [16,17,31]. On the logarithmically scaled x-axis
the threshold minsupp is lowered from 2% down to 0 . 125%. On the also log-
arithmically scaled y-axis we find the corresponding time for frequent pattern
generation. The employed dataset is a well known benchmark dataset first intro-
duced in [3]. Its average transaction size is 10, average frequent pattern size 4,
and it contains a total number of 1 million transactions. Obviously for none of
Search WWH ::




Custom Search