Database Reference
In-Depth Information
to eliminate those itemsets that were identified as frequent in the sample but are not fre-
quent in the whole, we have no false positives and hopefully have none or very few false
negatives.
6.4.3
The Algorithm of Savasere, Omiecinski, and Navathe
Our next improvement avoids both false negatives and false positives, at the cost of making
two full passes. It is called the SON Algorithm after the authors. The idea is to divide the
input file into chunks (which may be “chunks” in the sense of a distributed file system, or
simply a piece of the file). Treat each chunk as a sample, and run the algorithm of Section
6.4.1 on that chunk. We use ps as the threshold, if each chunk is fraction p of the whole file,
and s is the support threshold. Store on disk all the frequent itemsets found for each chunk.
Once all the chunks have been processed in that way, take the union of all the itemsets
that have been found frequent for one or more chunks. These are the candidate itemsets.
Notice that if an itemset is not frequent in any chunk, then its support is less than ps in
each chunk. Since the number of chunks is 1/ p , we conclude that the total support for that
itemset is less than (1/ p ) ps = s . Thus, every itemset that is frequent in the whole is frequent
in at least one chunk, and we can be sure that all the truly frequent itemsets are among the
candidates; i.e., there are no false negatives.
We have made a total of one pass through the data as we read each chunk and processed
it. In a second pass, we count all the candidate itemsets and select those that have support
at least s as the frequent itemsets.
6.4.4
The SON Algorithm and MapReduce
The SON algorithm lends itself well to a parallel-computing environment. Each of the
chunks can be processed in parallel, and the frequent itemsets from each chunk combined
to form the candidates. We can distribute the candidates to many processors, have each pro-
cessor count the support for each candidate in a subset of the baskets, and finally sum those
supports to get the support for each candidate itemset in the whole dataset. This process
does not have to be implemented in MapReduce, but there is a natural way of expressing
each of the two passes as a MapReduce operation. We shall summarize this MapReduce-
MapReduce sequence below.
First Map Function: Take the assigned subset of the baskets and find the itemsets frequent
in the subset using the algorithm of Section 6.4.1 . As described there, lower the support
threshold from s to ps if each Map task gets fraction p of the total input file. The output is
Search WWH ::




Custom Search