Databases Reference
In-Depth Information
H 2
bucket address
bucket count
bucket contents
0
2
{I1, I4}
{I3, I5}
1
2
{I1, I5}
{I1, I5}
2
4
{I2, I3}
{I2, I3}
{I2, I3}
{I2, I3}
3
2
{I2, I4}
{I2, I4}
4
2
{I2, I5}
{I2, I5}
5
4
{I1, I2}
{I1, I2}
{I1, I2}
{I1, I2}
6
4
{I1, I3}
{I1, I3}
{I1, I3}
{I1, I3}
Create hash table H 2
using hash function
h ( x , y )
(( order of x )
10
( order of y )) mod 7
Figure 6.5 Hash table, H 2 , for candidate 2-itemsets. This hash table was generated by scanning
Table 6.1's transactions while determining L 1 . If the minimum support count is, say, 3, then
the itemsets in buckets 0, 1, 3, and 4 cannot be frequent and so they should not be included
in C 2 .
Hash-based technique (hashing itemsets into corresponding buckets): A hash-based
technique can be used to reduce the size of the candidate k -itemsets, C k , for k
1.
For example, when scanning each transaction in the database to generate the frequent
1-itemsets, L 1 , we can generate all the 2-itemsets for each transaction, hash (i.e., map)
them into the different buckets of a hash table structure, and increase the correspond-
ing bucket counts (Figure 6.5). A 2-itemset with a corresponding bucket count in the
hash table that is below the support threshold cannot be frequent and thus should
be removed from the candidate set. Such a hash-based technique may substantially
reduce the number of candidate k -itemsets examined (especially when k D 2).
>
Transaction reduction (reducing the number of transactions scanned in future itera-
tions): A transaction that does not contain any frequent k -itemsets cannot contain any
frequent ( k C1)-itemsets. Therefore, such a transaction can be marked or removed
from further consideration because subsequent database scans for j -itemsets, where
j
>
k , will not need to consider such a transaction.
Partitioning (partitioning the data to find candidate itemsets): A partitioning tech-
nique can be used that requires just two database scans to mine the frequent itemsets
(Figure 6.6). It consists of two phases. In phase I, the algorithm divides the trans-
actions of D into n nonoverlapping partitions. If the minimum relative support
threshold for transactions in D is min sup , then the minimum support count for a
partition is min sup the number of transactions in that partition . For each partition,
all the local frequent itemsets (i.e., the itemsets frequent within the partition) are found.
A local frequent itemset may or may not be frequent with respect to the entire
database, D . However, any itemset that is potentially frequent with respect to D must
occur as a frequent itemset in at least one of the partitions . 8 Therefore, all local frequent
itemsets are candidate itemsets with respect to D . The collection of frequent itemsets
from all partitions forms the global candidate itemsets with respect to D . In phase II,
8 The proof of this property is left as an exercise (see Exercise 6.3d).
 
Search WWH ::




Custom Search