Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

customers' buying habits by searching for itemsets that are frequently purchased

together (or in sequence).

Association rule mining consists of first finding frequent itemsets (sets of items,

such as A and B , satisfying a minimum support threshold , or percentage of the task-

relevant tuples), from which strong association rules in the form of A ) B are

generated. These rules also satisfy a minimum confidence threshold (a prespecified

probability of satisfying B under the condition that A is satisfied). Associations can be

further analyzed to uncover correlation rules , which convey statistical correlations

between itemsets A and B .

Many efficient and scalable algorithms have been developed for frequent itemset

mining , from which association and correlation rules can be derived. These algo-

rithms can be classified into three categories: (1) Apriori-like algorithms , (2) frequent

pattern growth - based algorithms such as FP-growth, and (3) algorithms that use the

vertical data format .

The Apriori algorithm is a seminal algorithm for mining frequent itemsets for

Boolean association rules. It explores the level-wise mining Apriori property that all

nonempty subsets of a frequent itemset must also be frequent. At the k th iteration (for

k 2), it forms frequent k -itemset candidates based on the frequent

-itemsets,

and scans the database once to find the complete set of frequent k -itemsets, L k .

Variations involving hashing and transaction reduction can be used to make the

procedure more efficient. Other variations include partitioning the data (mining on

each partition and then combining the results) and sampling the data (mining on

a data subset). These variations can reduce the number of data scans required to as

little as two or even one.

Frequent pattern growth is a method of mining frequent itemsets without candidate

generation. It constructs a highly compact data structure (an FP-tree ) to compress the

original transaction database. Rather than employing the generate-and-test strategy of

Apriori-like methods, it focuses on frequent pattern (fragment) growth, which avoids

costly candidate generation, resulting in greater efficiency.

Mining frequent itemsets using the vertical data format (Eclat) is a method that

transforms a given data set of transactions in the horizontal data format of TID-

itemset into the vertical data format of item-TID set . It mines the transformed

data set by TID set intersections based on the Apriori property and additional

optimization techniques such as diffset .

Not all strong association rules are interesting. Therefore, the support-confidence

framework should be augmented with a pattern evaluation measure, which promotes

the mining of interesting rules. A measure is null-invariant if its value is free from

the influence of null-transactions (i.e., the transactions that do not contain any of

the itemsets being examined ). Among many pattern evaluation measures, we exam-

ined lift ,

.

k 1

/

2 , all confidence, max confidence, Kulczynski , and cosine , and showed

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home