Databases Reference
In-Depth Information
customers' buying habits by searching for itemsets that are frequently purchased
together (or in sequence).
Association rule mining
consists of first finding
frequent itemsets
(sets of items,
such as
A
and
B
, satisfying a
minimum support threshold
, or percentage of the task-
relevant tuples), from which
strong
association rules in the form of
A
)
B
are
generated. These rules also satisfy a
minimum confidence threshold
(a prespecified
probability of satisfying
B
under the condition that
A
is satisfied). Associations can be
further analyzed to uncover
correlation rules
, which convey statistical correlations
between itemsets
A
and
B
.
Many efficient and scalable algorithms have been developed for
frequent itemset
mining
, from which association and correlation rules can be derived. These algo-
rithms can be classified into three categories: (1)
Apriori-like algorithms
, (2)
frequent
pattern growth
-
based algorithms
such as FP-growth, and (3)
algorithms that use the
vertical data format
.
The
Apriori algorithm
is a seminal algorithm for mining frequent itemsets for
Boolean association rules. It explores the level-wise mining Apriori property that
all
nonempty subsets of a frequent itemset must also be frequent.
At the
k
th iteration (for
k
2), it forms frequent
k
-itemset candidates based on the frequent
-itemsets,
and scans the database once to find the
complete
set of frequent
k
-itemsets,
L
k
.
Variations involving hashing and transaction reduction can be used to make the
procedure more efficient. Other variations include partitioning the data (mining on
each partition and then combining the results) and sampling the data (mining on
a data subset). These variations can reduce the number of data scans required to as
little as two or even one.
Frequent pattern growth
is a method of mining frequent itemsets without candidate
generation. It constructs a highly compact data structure (an
FP-tree
) to compress the
original transaction database. Rather than employing the generate-and-test strategy of
Apriori-like methods, it focuses on frequent pattern (fragment) growth, which avoids
costly candidate generation, resulting in greater efficiency.
Mining frequent itemsets using the vertical data format (Eclat)
is a method that
transforms a given data set of transactions in the horizontal data format of
TID-
itemset
into the vertical data format of
item-TID set
. It mines the transformed
data set by
TID set
intersections based on the Apriori property and additional
optimization techniques such as
diffset
.
Not all strong association rules are interesting. Therefore, the support-confidence
framework should be augmented with a pattern evaluation measure, which promotes
the mining of
interesting
rules. A measure is
null-invariant
if its value is free from
the influence of
null-transactions
(i.e., the
transactions that do not contain any of
the itemsets being examined
). Among many pattern evaluation measures, we exam-
ined
lift
,
.
k
1
/
2
, all confidence, max confidence, Kulczynski
, and
cosine
, and showed