Database Reference
In-Depth Information
use of a Distributed Cache , which is an efficient file caching system built into the
Hadoop runtime.
A criticism of the model comes from its heavy reliance on disk access between the
MAP and REDUCE stages when key-value pairs must be grouped and partitioned across
REDUCE processes. If the computation is iterative in nature, a naïve MapReduce
implementation could require shuffling the data between physical disk drives in each
iteration.
Pattern growth methods are popular in the MapReduce environment, due to their
ability to decompose the problem into portions that can be independently solved.
After computing the support of 1-length patterns in a similar manner as the word fre-
quency counting example previously described, equivalence classes can be mapped
to different processes via MAP. A serial algorithm is used to complete the local task,
then a REDUCE job gathers the output from all processes.
4
Frequent Itemset Mining
Much like its serial counterpart, there are two main approaches to solving the parallel
frequent itemset mining (PFIM) problem, namely, candidate generation and pattern
growth. In this section, we discuss a number of proposed PFIM algorithms, paying
special attention to the algorithmic details addressing the three challenges introduced
in Sect. 3 . For easy reference, Table 10.2 lists the serial and parallel methods described
in this section.
4.1
Memory Scalability
Serial FIM algorithms can be easily parallelized if memory constraints are ignored.
Among the first to address the memory constraint issues related to PFIM were
Agrawal and Shafer [ 3 ], who proposed the Count Distribution (CD) algorithm. In
their method, the list of transactions
is distributed among the processes s.t. each
process is responsible for computing the support of all candidate itemsets with re-
spect to its local transactions. Instead of a globally accessible hash tree, each process
builds a local hash tree which includes all candidate itemsets. Then, with a single
pass over the local transactions, the local support for each candidate itemset can be
computed. The global support for each candidate itemset can then be computed as
the sum of each process' local support for the given candidate itemset, using a global
reduction operation. At the same time that Agrawal and Shafer introduced the CD
algorithm, Shintani and Kitsuregawa [ 52 ] introduced an identical algorithm which
they called Non-Partitioned Apriori (NPA).
Since each process can build its hash tree and compute its local support for candi-
date itemsets independently, the only inter-process communication required is during
T
Search WWH ::




Custom Search