Databases Reference
In-Depth Information
collecting those items that satisfy minimum support. The resulting set is denoted by L 1 .
Next, L 1 is used to find L 2 , the set of frequent 2-itemsets, which is used to find L 3 , and
so on, until no more frequent k -itemsets can be found. The finding of each L k requires
one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an
important property called the Apriori property is used to reduce the search space.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
The Apriori property is based on the following observation. By definition, if an item-
set I does not satisfy the minimum support threshold, min sup , then I is not frequent,
that is, P
min sup . If an item A is added to the itemset I , then the resulting itemset
(i.e., I [ A ) cannot occur more frequently than I . Therefore, I [ A is not frequent either,
that is, P
.
I
/<
min sup .
This property belongs to a special category of properties called antimonotonicity in
the sense that if a set cannot pass a test, all of its supersets will fail the same test as well . It
is called antimonotonicity because the property is monotonic in the context of failing a
test. 6
“How is the Apriori property used in the algorithm?” To understand this, let us look at
how L k 1 is used to find L k for k 2. A two-step process is followed, consisting of join
and prune actions.
.
I [ A
/<
1. The join step : To find L k , a set of candidate k -itemsets is generated by joining
L k 1 with itself. This set of candidates is denoted C k . Let l 1 and l 2 be itemsets
in L k 1 . The notation l i [ j ] refers to the j th item in l i (e.g., l 1 [ k 2] refers to
the second to the last item in l 1 ). For efficient implementation, Apriori assumes
that items within a transaction or itemset are sorted in lexicographic order. For
the
.
k 1
/
-itemset, l i , this means that the items are sorted such that l i [1]
<
l i [2]
<<
l i [ k 1]. The join, L k 1 1
L k 1 , is performed, where members of L k 1 are
joinable if their first
items are in common. That is, members l 1 and l 2
of L k 1 are joined if ( l 1 [1] D l 2 [1]
.
k 2
/
/^.
l 1 [2] D l 2 [2]
/^^.
l 1 [ k 2] D l 2 [ k 2]
/
^.
l 2 [ k 1] simply ensures that
no duplicates are generated. The resulting itemset formed by joining l 1 and l 2 is
f l 1 [1], l 1 [2],
l 1 [ k 1]
<
l 2 [ k 1]
/
. The condition l 1 [ k 1]
<
, l 1 [ k 2], l 1 [ k 1], l 2 [ k 1]g.
2. The prune step : C k is a superset of L k , that is, its members may or may not be
frequent, but all of the frequent k -itemsets are included in C k . A database scan to
determine the count of each candidate in C k would result in the determination of
L k (i.e., all candidates having a count no less than the minimum support count are
frequent by definition, and therefore belong to L k ). C k , however, can be huge, and so
this could involve heavy computation. To reduce the size of C k , the Apriori property
:::
6 The Apriori property has many applications. For example, it can also be used to prune search during
data cube computation (Chapter 5).
 
Search WWH ::




Custom Search