Database Reference
In-Depth Information
9.2
MapReduce Framework: Basic Architecture
The MapReduce framework is introduced as a simple and powerful programming
model that enables easy development of scalable parallel applications to process
vast amounts of data on large clusters of commodity machines [ 118 , 119 ]. In
particular, the implementation described in the original paper is mainly designed
to achieve high performance on large clusters of commodity PCs. One of the
main advantages of this approach is that it isolates the application from the details
of running a distributed program, such as issues on data distribution, scheduling
and fault tolerance. In this model, the computation takes a set of key/value pairs
input and produces a set of key/value pairs as output. The user of the MapReduce
framework expresses the computation using two functions: Map and Reduce .The
Map function takes an input pair and produces a set of intermediate key/value
pairs. The MapReduce framework groups together all intermediate values associated
with the same intermediate key I and passes them to the Reduce function. The
Reduce function receives an intermediate key I with its set of values and merges
them together. Typically just zero or one output value is produced per Reduce
invocation. The main advantage of this model is that it allows large computations
to be easily parallelized and re-executed to be used as the primary mechanism for
fault tolerance. Figure 9.1 illustrates an example MapReduce program expressed in
pseudo-code for counting the number of occurrences of each word in a collection of
documents. In this example, the map function emits each word plus an associated
count of occurrences while the reduce function sums together all counts emitted
for a particular word. In principle, the design of the MapReduce framework has
considered the following main principles [ 103 ]:
￿
Low-cost unreliable commodity hardware : Instead of using expensive, high-
performance, reliable symmetric multiprocessing (SMP) or massively parallel
processing (MPP) machines equipped with high-end network and storage sub-
systems, the MapReduce framework is designed to run on large clusters of
commodity hardware. This hardware is managed and powered by open-source
operating systems and utilities so that the cost is low.
￿
Extremely scalable RAIN cluster : Instead of using centralized RAID-based SAN
or NAS storage systems, every MapReduce node has its own local off-the-
shelf hard drives. These nodes are loosely coupled where they are placed in
Fig. 9.1
An example MapReduce program
Search WWH ::




Custom Search