Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods - Data Mining: Concepts and Techniques - page 255

Databases Reference

In-Depth Information

H 2

bucket address

bucket count

bucket contents

0

2

{I1, I4}

{I3, I5}

1

2

{I1, I5}

{I1, I5}

2

4

{I2, I3}

{I2, I3}

{I2, I3}

{I2, I3}

3

2

{I2, I4}

{I2, I4}

4

2

{I2, I5}

{I2, I5}

5

4

{I1, I2}

{I1, I2}

{I1, I2}

{I1, I2}

6

4

{I1, I3}

{I1, I3}

{I1, I3}

{I1, I3}

Create hash table H 2

using hash function

h ( x , y )

(( order of x )

10

( order of y )) mod 7

Figure 6.5 Hash table, H 2 , for candidate 2-itemsets. This hash table was generated by scanning

Table 6.1's transactions while determining L 1 . If the minimum support count is, say, 3, then

the itemsets in buckets 0, 1, 3, and 4 cannot be frequent and so they should not be included

in C 2 .

Hash-based technique (hashing itemsets into corresponding buckets): A hash-based

technique can be used to reduce the size of the candidate k -itemsets, C k , for k

1.

For example, when scanning each transaction in the database to generate the frequent

1-itemsets, L 1 , we can generate all the 2-itemsets for each transaction, hash (i.e., map)

them into the different buckets of a hash table structure, and increase the correspond-

ing bucket counts (Figure 6.5). A 2-itemset with a corresponding bucket count in the

hash table that is below the support threshold cannot be frequent and thus should

be removed from the candidate set. Such a hash-based technique may substantially

reduce the number of candidate k -itemsets examined (especially when k D 2).

>

Transaction reduction (reducing the number of transactions scanned in future itera-

tions): A transaction that does not contain any frequent k -itemsets cannot contain any

frequent ( k C1)-itemsets. Therefore, such a transaction can be marked or removed

from further consideration because subsequent database scans for j -itemsets, where

j

>

k , will not need to consider such a transaction.

Partitioning (partitioning the data to find candidate itemsets): A partitioning tech-

nique can be used that requires just two database scans to mine the frequent itemsets

(Figure 6.6). It consists of two phases. In phase I, the algorithm divides the trans-

actions of D into n nonoverlapping partitions. If the minimum relative support

threshold for transactions in D is min sup , then the minimum support count for a

partition is min sup the number of transactions in that partition . For each partition,

all the local frequent itemsets (i.e., the itemsets frequent within the partition) are found.

A local frequent itemset may or may not be frequent with respect to the entire

database, D . However, any itemset that is potentially frequent with respect to D must

occur as a frequent itemset in at least one of the partitions . 8 Therefore, all local frequent

itemsets are candidate itemsets with respect to D . The collection of frequent itemsets

from all partitions forms the global candidate itemsets with respect to D . In phase II,

8 The proof of this property is left as an exercise (see Exercise 6.3d).

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home