Database Reference
In-Depth Information
each transaction t is a subset of I . The pattern language
consists of itemsets, i.e.,
again sets of items. The support count of an itemset X in D is defined as the number
of transactions that contain X , i.e., supp D ( X )
L
=|{ t D | X t }|
. We write fr D ( X )
to denote the relative support of X in D , i.e., fr D ( X )
=
supp D ( X ) /
|
D
|
.Wedonot
write D wherever clear from context.
The 'interestingness' predicate is a threshold on the support of the itemsets, the
minimal support: minsup . In other words, the task in frequent set mining is to
compute
{
X
L |
supp D ( X )
minsup
}
The itemsets in the result are called frequent itemsets.
Intuition The intuition behind this measure is simple: the more often an itemset
occurs in the data, the more interesting it is.
Frequent itemset mining was originally not a goal on itself, but merely a necessary
step in order to mine association rules [ 3 ]. There, the task is to discover rules X
Y ,
where X and Y are itemsets with X
Y
=∅
, such that when itemset X is a subset
ofarow t
t . Such
rules express associations, possibly correlations, and can hence be useful in many
applications. A main motivation was supermarket basket analysis, the idea being that
by advertising X , people will also buy more of Y .
The basic strategy for mining association rules is to first mine frequent patterns,
and then consider all partitionings of each frequent itemset Z into non-overlapping
subsets X and Y , to form candidate rules X Y , while finally keeping only
those association rules that satisfy some quality threshold [ 3 ]. Though an interesting
research topic on itself, interestingness measures for association rules are beyond
the scope of this chapter. We refer the interested reader to the recent survey by Tew
et al. [ 69 ].
A Priori With a search space of 2 | I | patterns, the naive approach of evaluating
every pattern is infeasible. However, in 1994 it was discovered that support exhibits
monotonicity. That is, for two itemsets X and Y , we know
D , X
t , with high confidence we will also see itemset Y
X
Y
supp ( X )
supp ( Y ),
which is known as the A Priori property [ 1 , 44 ], and allows for efficient search for
frequent itemsets over the lattice of all itemsets.
The A Priori algorithm was independently discovered by Agrawal and Srikant
[ 1 ], and by Mannila, Toivonen, and Verkamo [ 44 ]. It is a so-called candidate test
framework. Given a transaction database D over a set of items
and a support
threshold minsup, it first determines the set of singleton frequent itemsets
I
F 1
=
{
i
I
|
supp ( i )
minsup
}
. Then, given a set
F k of frequent patterns of length
C k + 1 of candidate frequent patterns of length k
+
k , we can construct the set
1, by
F k .We
then determine the supports of all candidates in one pass over the data, and obtain
F k + 1 by keeping only the candidates with supp ( X )
considering only itemsets that have all k sub-itemsets of length k included in
minsup .
Search WWH ::




Custom Search