Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

Parallel FPM algorithms designed for message passing based systems typically

partition data such that each process holds an equal share of the input in local mem-

ory. As a consequence, subsets of the local transactions must be shared with other

processes that need them. For example, in candidate generation approaches, one may

choose to broadcast the local input among all processes, during each iteration of the

algorithm, in a round robin fashion. The amount of communication necessary in this

scenario may be a detriment to the overall execution. Moreover, the set of candidate

itemsets must also be partitioned across processes, incurring additional communica-

tion overhead. An alternative approach may use a pattern growth based algorithm that

mines distinct projected databases associated with an equivalence class. Once a pro-

jected database has been extracted by communicating with the other processes that

store its transactions, it can be mined by a process without further communication

overhead. However, care must be taken to choose small enough equivalence classes

s.t. their projected databases and count data structures fit in the local process memory.

3.3.2

MapReduce

MapReduce [ 13 ] is a recent programming model for distributed memory systems that

has become very popular for data-intensive computing. By using a restricted pro-

gram model, MapReduce offers a simple method of writing parallel programs. While

originally a proprietary software package at Google, several successful MapRe-

duce open-source implementations have been developed, of which Hadoop [ 59 ]is

currently the most popular.

Computation in MapReduce consists of supplying two routines, MAP and

REDUCE. The problem input is specified as a set of key-value pairs . Each key is pro-

cessed by an invocation of MAP, which emits another (possibly different) key-value

pair. Emitted key-value pairs are grouped by key by the MapReduce framework, and

then the list of grouped values in each group is processed individually by a REDUCE

invocation. REDUCE in turn emits either the final program output or new key-value

pairs that can be processed with MAP for another iteration.

The canonical example of a MapReduce program is word frequency counting in a

set of documents. Program input is organized into key-value pairs of the form < ID,

text > .AMAP process is assigned to each ID . MAP iterates over each word w i in

its assigned document and emits the pair < w i ,1 > . Pairs from all MAP processes

are grouped and passed to a single REDUCE process. Finally, REDUCE counts the

appearances of each w i and outputs the final counts.

Individual MAP and REDUCE tasks are independent and can be executed in paral-

lel. Unlike the message passing paradigm, parallelism in MapReduce is implicit , and

the program developer is not required to consider low-level details such as data place-

ment or communication across memory address spaces. The MapReduce framework

manages all of the necessary communication details and typically implements key-

value pair transmissions via a networked file system such as GFS [ 15 ] or HDFS [ 7 ].

Certain MapReduce implementations provide some means to make global read-only

data easily accessible by all MAP processes. In Hadoop, this is achieved through the

Frequent Pattern Mining

Search WWH ::

Custom Search

Home