Database Reference
In-Depth Information
a set of key-value pairs ( F, 1), where F is a frequent itemset from the sample. The value is
always 1 and is irrelevant.
First Reduce Function: Each Reduce task is assigned a set of keys, which are itemsets.
The value is ignored, and the Reduce task simply produces those keys (itemsets) that ap-
pear one or more times. Thus, the output of the first Reduce function is the candidate item-
sets.
Second Map Function: The Map tasks for the second Map function take all the output
from the first Reduce Function (the candidate itemsets) and a portion of the input data file.
Each Map task counts the number of occurrences of each of the candidate itemsets among
the baskets in the portion of the dataset that it was assigned. The output is a set of key-value
pairs ( C, v ), where C is one of the candidate sets and v is the support for that itemset among
the baskets that were input to this Map task.
Second Reduce Function: The Reduce tasks take the itemsets they are given as keys and
sum the associated values. The result is the total support for each of the itemsets that the
Reduce task was assigned to handle. Those itemsets whose sum of values is at least s are
frequent in the whole dataset, so the Reduce task outputs these itemsets with their counts.
Itemsets that do not have total support at least s are not transmitted to the output of the Re-
duce task. 2
6.4.5
Toivonen's Algorithm
This algorithm uses randomness in a different way from the simple sampling algorithm of
Section 6.4.1 . Toivonen's Algorithm, given sufficient main memory, will use one pass over
a small sample and one full pass over the data. It will give neither false negatives nor pos-
itives, but there is a small yet nonzero probability that it will fail to produce any answer
at all. In that case it needs to be repeated until it gives an answer. However, the average
number of passes needed before it produces all and only the frequent itemsets is a small
constant.
Toivonen's algorithm begins by selecting a small sample of the input dataset, and finding
from it the candidate frequent itemsets. The process is exactly that of Section 6.4.1 , ex-
cept that it is essential the threshold be set to something less than its proportional value.
That is, if the support threshold for the whole dataset is s , and the sample size is fraction
p , then when looking for frequent itemsets in the sample, use a threshold such as 0 . 9 ps or
0 . 8 ps . The smaller we make the threshold, the more main memory we need for computing
all itemsets that are frequent in the sample, but the more likely we are to avoid the situation
where the algorithm fails to provide an answer.
Search WWH ::




Custom Search