Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

collecting those items that satisfy minimum support. The resulting set is denoted by L 1 .

Next, L 1 is used to find L 2 , the set of frequent 2-itemsets, which is used to find L 3 , and

so on, until no more frequent k -itemsets can be found. The finding of each L k requires

one full scan of the database.

To improve the efficiency of the level-wise generation of frequent itemsets, an

important property called the Apriori property is used to reduce the search space.

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

The Apriori property is based on the following observation. By definition, if an item-

set I does not satisfy the minimum support threshold, min sup , then I is not frequent,

that is, P

min sup . If an item A is added to the itemset I , then the resulting itemset

(i.e., I [ A ) cannot occur more frequently than I . Therefore, I [ A is not frequent either,

that is, P

min sup .

This property belongs to a special category of properties called antimonotonicity in

the sense that if a set cannot pass a test, all of its supersets will fail the same test as well . It

is called antimonotonicity because the property is monotonic in the context of failing a

test. 6

“How is the Apriori property used in the algorithm?” To understand this, let us look at

how L k 1 is used to find L k for k 2. A two-step process is followed, consisting of join

and prune actions.

I [ A

1. The join step : To find L k , a set of candidate k -itemsets is generated by joining

L k 1 with itself. This set of candidates is denoted C k . Let l 1 and l 2 be itemsets

in L k 1 . The notation l i [ j ] refers to the j th item in l i (e.g., l 1 [ k 2] refers to

the second to the last item in l 1 ). For efficient implementation, Apriori assumes

that items within a transaction or itemset are sorted in lexicographic order. For

the

k 1

-itemset, l i , this means that the items are sorted such that l i [1]

l i [2]

l i [ k 1]. The join, L k 1 1

L k 1 , is performed, where members of L k 1 are

joinable if their first

items are in common. That is, members l 1 and l 2

of L k 1 are joined if ( l 1 [1] D l 2 [1]

k 2

/^.

l 1 [2] D l 2 [2]

/^^.

l 1 [ k 2] D l 2 [ k 2]

l 2 [ k 1] simply ensures that

no duplicates are generated. The resulting itemset formed by joining l 1 and l 2 is

f l 1 [1], l 1 [2],

l 1 [ k 1]

l 2 [ k 1]

. The condition l 1 [ k 1]

, l 1 [ k 2], l 1 [ k 1], l 2 [ k 1]g.

2. The prune step : C k is a superset of L k , that is, its members may or may not be

frequent, but all of the frequent k -itemsets are included in C k . A database scan to

determine the count of each candidate in C k would result in the determination of

L k (i.e., all candidates having a count no less than the minimum support count are

frequent by definition, and therefore belong to L k ). C k , however, can be huge, and so

this could involve heavy computation. To reduce the size of C k , the Apriori property

:::

6 The Apriori property has many applications. For example, it can also be used to prune search during

data cube computation (Chapter 5).

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home