Databases Reference
In-Depth Information
6.1.2
Frequent Itemsets, Closed Itemsets,
and Association Rules
LetIDf
I
1
,
I
2
,
,
I
m
g be an itemset. Let
D
, the task-relevant data, be a set of database
transactions where each transaction
T
is a nonempty itemset such that
T
I. Each
transaction is associated with an identifier, called a
TID
. Let
A
be a set of items. A trans-
action
T
is said to contain
A
if
A
T
. An association rule is an implication of the form
A
)
B
, where
A
I,
B
I,
A
6D;,
B
6D;, and
A
\
B
D
:::
. The rule
A
)
B
holds in the
transaction set
D
with
support
s
, where
s
is the percentage of transactions in
D
that
contain
A
[
B
(i.e., the
union
of sets
A
and
B
say, or, both
A
and
B
). This is taken to be
the probability,
P
.
1
The rule
A
)
B
has
confidence
c
in the transaction set
D
,
where
c
is the percentage of transactions in
D
containing
A
that also contain
B
. This is
taken to be the conditional probability,
P
.
A
[
B
/
.
B
j
A
/
. That is,
support
.
A
)
B
/D
P
.
A
[
B
/
(6.2)
confidence
.
A
)
B
/D
P
.
B
j
A
/
.
(6.3)
Rules that satisfy both a minimum support threshold (
min sup
) and a minimum con-
fidence threshold (
min
conf
) are called
strong
. By convention, we write support and
confidence values so as to occur between 0% and 100%, rather than 0 to 1.0.
A set of items is referred to as an
itemset
.
2
An itemset that contains
k
items is a
k
-itemset
. The set f
computer, antivirus software
g is a 2-itemset. The
occurrence fre-
quency of an itemset
is the number of transactions that contain the itemset. This is
also known, simply, as the
frequency
,
support count
, or
count
of the itemset. Note
that the itemset support defined in Eq. (6.2) is sometimes referred to as
relative support
,
whereas the occurrence frequency is called the
absolute support
. If the relative support
of an itemset
I
satisfies a prespecified
minimum support threshold
(i.e., the absolute
support of
I
satisfies the corresponding
minimum support count threshold
), then
I
is
a
frequent
itemset.
3
The set of frequent
k
-itemsets is commonly denoted by
L
k
.
4
From Eq. (6.3), we have
support
.
A
[
B
/
support count
.
A
[
B
/
confidence
.
A
)
B
/D
P
.
B
j
A
/D
D
.
(6.4)
support
.
A
/
support count
.
A
/
1
Notice that the notation
P
indicates the probability that a transaction contains the
union
of sets
A
and
B
(i.e., it contains every item in
A
and
B
). This should not be confused with
P
.
A
[
B
/
.
A or B
/
, which
indicates the probability that a transaction contains either
A
or
B
.
2
In the data mining research literature, “itemset” is more commonly used than “item set.”
3
In early work, itemsets satisfying minimum support were referred to as
large
. This term, however,
is somewhat confusing as it has connotations of the number of items in an itemset rather than the
frequency of occurrence of the set. Hence, we use the more recent term
frequent
.
4
Although the term
frequent
is preferred over
large
, for historic reasons frequent
k
-itemsets are still
denoted as
L
k
.