Database Reference
In-Depth Information
abcd (1)
frequent
abc (2)
abd (3)
acd (1)
bcd (1)
abcd
ndi
1111
1110
1101
1000
1101
0010
ab (4)
ac (2)
bc (2)
ad (3)
bd (3)
cd (1)
free
a (5)
b (4)
c (2)
d (3)
(6)
a
b
dataset
itemset lattice
Fig. 5.1 A dataset of 4 items and 6 transactions and the corresponding lattice. The lattice shows the
free, non-derivable, and, for minsup = 2, the frequent itemsets. Closed itemsets are highlighted
As an example, consider Fig. 5.1 , where we depict a toy dataset and the lattice of
all itemsets. Say we aim to mine all frequent itemsets with a minimal support of 2,
i.e., a minimum frequency of 2 / 6
1 / 3. The A Priori algorithm considers the lattice
level-wise, and first identifies the frequent singleton itemsets. Here, a , b , c , and d are
all frequent. It then constructs the candidate set by taking the Cartesian product with
the frequent singletons. In this example this is the full set of itemsets of cardinality 2,
i.e.,
=
. We calculate the support of all candidates, and find
that all itemsets, except cd , are frequent, i.e.,
C 2 ={
ab , ac , bc , ad , bd , cd
}
F 2 ={
ab , ac , bc , ad , bd
}
. Iterating to
the third level, we have
F 2 contain cd ,of
which we know it is not frequent, and hence neither will any larger itemset containing
it. We find that the two remaining candidates are frequent,
C 3 ={
abc , abd
}
, as all other extensions of
C 4 =∅
as there are no itemsets of size 4 that have all of their sub-itemsets of length 3 in
F 3 = C 3 . Finally,
F 3 .
Hence, the answer to the stated problem, the complete set of frequent itemsets for
minsup
.
The A Priori, or, perhaps more aptly named, level-wise algorithm can be applied
for any enumerable pattern language
=
2, is hence
F ={
a , b , c , ab , ac , bc , ad , bd , abc , abd
}
and monotonic interestingness measure q .
Soon after the discovery of the A Priori property, we see three major focal points
for further research. In particular, a lot of attention was given to investigating more
efficient algorithms for mining frequent itemsets [ 24 ] (see also Chaps. 2 and 3),
methods that can mine frequent patterns from data types other than binary (see
Chap. 11), and third, on further measures of interestingness (this chapter).
L
Pattern Explosion Now armed with the ability to mine frequent itemsets in prac-
tice, researchers quickly found that frequency is not quite the ideal interestingness
measure. That is, we find that for high support thresholds only find patterns represent-
ing common knowledge are discovered. However, when we lower the threshold, we
are typically quickly flooded with such enormous amounts of results that it becomes
impossible to inspect or use them. Moreover, the result set is highly redundant: very
many of the returned patterns are simply variations of each other. Combined, this
 
Search WWH ::




Custom Search