Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

9.2

MapReduce Framework: Basic Architecture

The MapReduce framework is introduced as a simple and powerful programming

model that enables easy development of scalable parallel applications to process

vast amounts of data on large clusters of commodity machines [ 118 , 119 ]. In

particular, the implementation described in the original paper is mainly designed

to achieve high performance on large clusters of commodity PCs. One of the

main advantages of this approach is that it isolates the application from the details

of running a distributed program, such as issues on data distribution, scheduling

and fault tolerance. In this model, the computation takes a set of key/value pairs

input and produces a set of key/value pairs as output. The user of the MapReduce

framework expresses the computation using two functions: Map and Reduce .The

Map function takes an input pair and produces a set of intermediate key/value

pairs. The MapReduce framework groups together all intermediate values associated

with the same intermediate key I and passes them to the Reduce function. The

Reduce function receives an intermediate key I with its set of values and merges

them together. Typically just zero or one output value is produced per Reduce

invocation. The main advantage of this model is that it allows large computations

to be easily parallelized and re-executed to be used as the primary mechanism for

fault tolerance. Figure 9.1 illustrates an example MapReduce program expressed in

pseudo-code for counting the number of occurrences of each word in a collection of

documents. In this example, the map function emits each word plus an associated

count of occurrences while the reduce function sums together all counts emitted

for a particular word. In principle, the design of the MapReduce framework has

considered the following main principles [ 103 ]:

Low-cost unreliable commodity hardware : Instead of using expensive, high-

performance, reliable symmetric multiprocessing (SMP) or massively parallel

processing (MPP) machines equipped with high-end network and storage sub-

systems, the MapReduce framework is designed to run on large clusters of

commodity hardware. This hardware is managed and powered by open-source

operating systems and utilities so that the cost is low.

Extremely scalable RAIN cluster : Instead of using centralized RAID-based SAN

or NAS storage systems, every MapReduce node has its own local off-the-

shelf hard drives. These nodes are loosely coupled where they are placed in

Fig. 9.1

An example MapReduce program

Search WWH ::

Custom Search

Home