Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

use of a Distributed Cache , which is an efficient file caching system built into the

Hadoop runtime.

A criticism of the model comes from its heavy reliance on disk access between the

MAP and REDUCE stages when key-value pairs must be grouped and partitioned across

REDUCE processes. If the computation is iterative in nature, a naïve MapReduce

implementation could require shuffling the data between physical disk drives in each

iteration.

Pattern growth methods are popular in the MapReduce environment, due to their

ability to decompose the problem into portions that can be independently solved.

After computing the support of 1-length patterns in a similar manner as the word fre-

quency counting example previously described, equivalence classes can be mapped

to different processes via MAP. A serial algorithm is used to complete the local task,

then a REDUCE job gathers the output from all processes.

4

Frequent Itemset Mining

Much like its serial counterpart, there are two main approaches to solving the parallel

frequent itemset mining (PFIM) problem, namely, candidate generation and pattern

growth. In this section, we discuss a number of proposed PFIM algorithms, paying

special attention to the algorithmic details addressing the three challenges introduced

in Sect. 3 . For easy reference, Table 10.2 lists the serial and parallel methods described

in this section.

4.1

Memory Scalability

Serial FIM algorithms can be easily parallelized if memory constraints are ignored.

Among the first to address the memory constraint issues related to PFIM were

Agrawal and Shafer [ 3 ], who proposed the Count Distribution (CD) algorithm. In

their method, the list of transactions

is distributed among the processes s.t. each

process is responsible for computing the support of all candidate itemsets with re-

spect to its local transactions. Instead of a globally accessible hash tree, each process

builds a local hash tree which includes all candidate itemsets. Then, with a single

pass over the local transactions, the local support for each candidate itemset can be

computed. The global support for each candidate itemset can then be computed as

the sum of each process' local support for the given candidate itemset, using a global

reduction operation. At the same time that Agrawal and Shafer introduced the CD

algorithm, Shintani and Kitsuregawa [ 52 ] introduced an identical algorithm which

they called Non-Partitioned Apriori (NPA).

Since each process can build its hash tree and compute its local support for candi-

date itemsets independently, the only inter-process communication required is during

T

Search WWH ::

Custom Search

Home