Databases Reference
In-Depth Information
6.1.2 Frequent Itemsets, Closed Itemsets,
and Association Rules
LetIDf I 1 , I 2 ,
, I m g be an itemset. Let D , the task-relevant data, be a set of database
transactions where each transaction T is a nonempty itemset such that T I. Each
transaction is associated with an identifier, called a TID . Let A be a set of items. A trans-
action T is said to contain A if A T . An association rule is an implication of the form
A ) B , where A I, B I, A 6D;, B 6D;, and A \ B D
:::
. The rule A ) B holds in the
transaction set D with support s , where s is the percentage of transactions in D that
contain A [ B (i.e., the union of sets A and B say, or, both A and B ). This is taken to be
the probability, P
. 1 The rule A ) B has confidence c in the transaction set D ,
where c is the percentage of transactions in D containing A that also contain B . This is
taken to be the conditional probability, P
.
A [ B
/
.
B j A
/
. That is,
support
.
A ) B
/D P
.
A [ B
/
(6.2)
confidence
.
A ) B
/D P
.
B j A
/
.
(6.3)
Rules that satisfy both a minimum support threshold ( min sup ) and a minimum con-
fidence threshold ( min conf ) are called strong . By convention, we write support and
confidence values so as to occur between 0% and 100%, rather than 0 to 1.0.
A set of items is referred to as an itemset . 2 An itemset that contains k items is a
k -itemset . The set f computer, antivirus software g is a 2-itemset. The occurrence fre-
quency of an itemset is the number of transactions that contain the itemset. This is
also known, simply, as the frequency , support count , or count of the itemset. Note
that the itemset support defined in Eq. (6.2) is sometimes referred to as relative support ,
whereas the occurrence frequency is called the absolute support . If the relative support
of an itemset I satisfies a prespecified minimum support threshold (i.e., the absolute
support of I satisfies the corresponding minimum support count threshold ), then I is
a frequent itemset. 3 The set of frequent k -itemsets is commonly denoted by L k . 4
From Eq. (6.3), we have
support
.
A [ B
/
support count
.
A [ B
/
confidence
.
A ) B
/D P
.
B j A
/D
D
.
(6.4)
support
.
A
/
support count
.
A
/
1 Notice that the notation P
indicates the probability that a transaction contains the union of sets
A and B (i.e., it contains every item in A and B ). This should not be confused with P
.
A [ B
/
.
A or B
/
, which
indicates the probability that a transaction contains either A or B .
2 In the data mining research literature, “itemset” is more commonly used than “item set.”
3 In early work, itemsets satisfying minimum support were referred to as large . This term, however,
is somewhat confusing as it has connotations of the number of items in an itemset rather than the
frequency of occurrence of the set. Hence, we use the more recent term frequent .
4 Although the term frequent is preferred over large , for historic reasons frequent k -itemsets are still
denoted as L k .
 
Search WWH ::




Custom Search