Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

racks that can be connected with standard networking hardware connections.

These nodes can be taken out of service with almost no impact to still-running

MapReduce jobs. These clusters are called Redundant Array of Independent (and

Inexpensive) Nodes (RAIN).

Fault-tolerant yet easy to administer : MapReduce jobs can run on clusters with

thousands of nodes or even more. These nodes are not very reliable as at any point

in time, a certain percentage of these commodity nodes or hard drives will be out

of order. Hence, the MapReduce framework applies straightforward mechanisms

to replicate data and launch backup tasks so as to keep still-running processes

going. To handle crashed nodes, system administrators simply take crashed

hardware off-line. New nodes can be plugged in at any time without much

administrative hassle. There is no complicated backup, restore and recovery

configurations like the ones that can be seen in many DBMS.

Highly parallel yet abstracted : The most important contribution of the MapRe-

duce framework is its ability to automatically support the parallelization of task

executions. Hence, it allows developers to focus mainly on the problem at hand

rather than worrying about the low level implementation details such as memory

management, file allocation, parallel, multi-threaded or network programming.

Moreover, MapReduce's shared-nothing architecture [ 215 ] makes it much more

scalable and ready for parallelization.

Hadoop [ 9 ] is an open source Java library [ 230 ] that supports data-intensive

distributed applications by realizing the implementation of the MapReduce frame-

work. 1 It has been widely used by a large number of business companies for

production purposes. 2 On the implementation level, the Map invocations of a

MapReduce job are distributed across multiple machines by automatically parti-

tioning the input data into a set of M splits. The input splits can be processed in

parallel by different machines. Reduce invocations are distributed by partitioning the

intermediate key space into R pieces using a partitioning function (e.g. hash(key)

mod R). The number of partitions (R) and the partitioning function are specified

by the user. Figure 9.2 illustrates an example of the overall flow of a MapReduce

operation which goes through the following sequence of actions:

1. The input data of the MapReduce program is split into M pieces and starts up

many instances of the program on a cluster of machines.

2. One of the instances of the program is elected to be the master copy while the

rest are considered as workers that are assigned their work by the master copy. In

particular, there are M map tasks and R reduce tasks to assign. The master picks

idle workers and assigns each one or more map tasks and/or reduce tasks.

3. A worker who is assigned a map task processes the contents of the corresponding

input split and generates key/value pairs from the input data and passes each pair

to the user-defined Map function. The intermediate key/value pairs produced by

the Map function are buffered in memory.

1 In the rest of this chapter, we use the two names: MapReduce and Hadoop, interchangeably.

2 http://wiki.apache.org/hadoop/PoweredBy .

Cloud Data Management

Search WWH ::

Custom Search

Home