Database Reference
In-Depth Information
Parallel FPM algorithms designed for message passing based systems typically
partition data such that each process holds an equal share of the input in local mem-
ory. As a consequence, subsets of the local transactions must be shared with other
processes that need them. For example, in candidate generation approaches, one may
choose to broadcast the local input among all processes, during each iteration of the
algorithm, in a round robin fashion. The amount of communication necessary in this
scenario may be a detriment to the overall execution. Moreover, the set of candidate
itemsets must also be partitioned across processes, incurring additional communica-
tion overhead. An alternative approach may use a pattern growth based algorithm that
mines distinct projected databases associated with an equivalence class. Once a pro-
jected database has been extracted by communicating with the other processes that
store its transactions, it can be mined by a process without further communication
overhead. However, care must be taken to choose small enough equivalence classes
s.t. their projected databases and count data structures fit in the local process memory.
3.3.2
MapReduce
MapReduce [ 13 ] is a recent programming model for distributed memory systems that
has become very popular for data-intensive computing. By using a restricted pro-
gram model, MapReduce offers a simple method of writing parallel programs. While
originally a proprietary software package at Google, several successful MapRe-
duce open-source implementations have been developed, of which Hadoop [ 59 ]is
currently the most popular.
Computation in MapReduce consists of supplying two routines, MAP and
REDUCE. The problem input is specified as a set of key-value pairs . Each key is pro-
cessed by an invocation of MAP, which emits another (possibly different) key-value
pair. Emitted key-value pairs are grouped by key by the MapReduce framework, and
then the list of grouped values in each group is processed individually by a REDUCE
invocation. REDUCE in turn emits either the final program output or new key-value
pairs that can be processed with MAP for another iteration.
The canonical example of a MapReduce program is word frequency counting in a
set of documents. Program input is organized into key-value pairs of the form < ID,
text > .AMAP process is assigned to each ID . MAP iterates over each word w i in
its assigned document and emits the pair < w i ,1 > . Pairs from all MAP processes
are grouped and passed to a single REDUCE process. Finally, REDUCE counts the
appearances of each w i and outputs the final counts.
Individual MAP and REDUCE tasks are independent and can be executed in paral-
lel. Unlike the message passing paradigm, parallelism in MapReduce is implicit , and
the program developer is not required to consider low-level details such as data place-
ment or communication across memory address spaces. The MapReduce framework
manages all of the necessary communication details and typically implements key-
value pair transmissions via a networked file system such as GFS [ 15 ] or HDFS [ 7 ].
Certain MapReduce implementations provide some means to make global read-only
data easily accessible by all MAP processes. In Hadoop, this is achieved through the
Search WWH ::




Custom Search