Databases Reference
In-Depth Information
collecting those items that satisfy minimum support. The resulting set is denoted by
L
1
.
Next,
L
1
is used to find
L
2
, the set of frequent 2-itemsets, which is used to find
L
3
, and
so on, until no more frequent
k
-itemsets can be found. The finding of each
L
k
requires
one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an
important property called the
Apriori property
is used to reduce the search space.
Apriori property:
All nonempty subsets of a frequent itemset must also be frequent.
The Apriori property is based on the following observation. By definition, if an item-
set
I
does not satisfy the minimum support threshold,
min sup
, then
I
is not frequent,
that is,
P
min sup
. If an item
A
is added to the itemset
I
, then the resulting itemset
(i.e.,
I
[
A
) cannot occur more frequently than
I
. Therefore,
I
[
A
is not frequent either,
that is,
P
.
I
/<
min sup
.
This property belongs to a special category of properties called
antimonotonicity
in
the sense that
if a set cannot pass a test, all of its supersets will fail the same test as well
. It
is called
antimonotonicity
because the property is monotonic in the context of failing a
test.
6
“How is the Apriori property used in the algorithm?”
To understand this, let us look at
how
L
k
1
is used to find
L
k
for
k
2. A two-step process is followed, consisting of
join
and
prune
actions.
.
I
[
A
/<
1.
The join step
: To find
L
k
, a set of
candidate
k
-itemsets is generated by joining
L
k
1
with itself. This set of candidates is denoted
C
k
. Let
l
1
and
l
2
be itemsets
in
L
k
1
. The notation
l
i
[
j
] refers to the
j
th item in
l
i
(e.g.,
l
1
[
k
2] refers to
the second to the last item in
l
1
). For efficient implementation, Apriori assumes
that items within a transaction or itemset are sorted in lexicographic order. For
the
.
k
1
/
-itemset,
l
i
, this means that the items are sorted such that
l
i
[1]
<
l
i
[2]
<<
l
i
[
k
1]. The join,
L
k
1
1
L
k
1
, is performed, where members of
L
k
1
are
joinable if their first
items are in common. That is, members
l
1
and
l
2
of
L
k
1
are joined if (
l
1
[1] D
l
2
[1]
.
k
2
/
/^.
l
1
[2] D
l
2
[2]
/^^.
l
1
[
k
2] D
l
2
[
k
2]
/
^.
l
2
[
k
1] simply ensures that
no duplicates are generated. The resulting itemset formed by joining
l
1
and
l
2
is
f
l
1
[1],
l
1
[2],
l
1
[
k
1]
<
l
2
[
k
1]
/
. The condition
l
1
[
k
1]
<
,
l
1
[
k
2],
l
1
[
k
1],
l
2
[
k
1]g.
2.
The prune step
:
C
k
is a superset of
L
k
, that is, its members may or may not be
frequent, but all of the frequent
k
-itemsets are included in
C
k
. A database scan to
determine the count of each candidate in
C
k
would result in the determination of
L
k
(i.e., all candidates having a count no less than the minimum support count are
frequent by definition, and therefore belong to
L
k
).
C
k
, however, can be huge, and so
this could involve heavy computation. To reduce the size of
C
k
, the Apriori property
:::
6
The Apriori property has many applications. For example, it can also be used to prune search during
data cube computation (Chapter 5).