Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

a set of key-value pairs ( F, 1), where F is a frequent itemset from the sample. The value is

always 1 and is irrelevant.

First Reduce Function: Each Reduce task is assigned a set of keys, which are itemsets.

The value is ignored, and the Reduce task simply produces those keys (itemsets) that ap-

pear one or more times. Thus, the output of the first Reduce function is the candidate item-

sets.

Second Map Function: The Map tasks for the second Map function take all the output

from the first Reduce Function (the candidate itemsets) and a portion of the input data file.

Each Map task counts the number of occurrences of each of the candidate itemsets among

the baskets in the portion of the dataset that it was assigned. The output is a set of key-value

pairs ( C, v ), where C is one of the candidate sets and v is the support for that itemset among

the baskets that were input to this Map task.

Second Reduce Function: The Reduce tasks take the itemsets they are given as keys and

sum the associated values. The result is the total support for each of the itemsets that the

Reduce task was assigned to handle. Those itemsets whose sum of values is at least s are

frequent in the whole dataset, so the Reduce task outputs these itemsets with their counts.

Itemsets that do not have total support at least s are not transmitted to the output of the Re-

duce task. 2

6.4.5

Toivonen's Algorithm

This algorithm uses randomness in a different way from the simple sampling algorithm of

Section 6.4.1 . Toivonen's Algorithm, given sufficient main memory, will use one pass over

a small sample and one full pass over the data. It will give neither false negatives nor pos-

itives, but there is a small yet nonzero probability that it will fail to produce any answer

at all. In that case it needs to be repeated until it gives an answer. However, the average

number of passes needed before it produces all and only the frequent itemsets is a small

constant.

Toivonen's algorithm begins by selecting a small sample of the input dataset, and finding

from it the candidate frequent itemsets. The process is exactly that of Section 6.4.1 , ex-

cept that it is essential the threshold be set to something less than its proportional value.

That is, if the support threshold for the whole dataset is s , and the sample size is fraction

p , then when looking for frequent itemsets in the sample, use a threshold such as 0 . 9 ps or

0 . 8 ps . The smaller we make the threshold, the more main memory we need for computing

all itemsets that are frequent in the sample, but the more likely we are to avoid the situation

where the algorithm fails to provide an answer.

Search WWH ::

Custom Search

Home