Database Reference
In-Depth Information
racks that can be connected with standard networking hardware connections.
These nodes can be taken out of service with almost no impact to still-running
MapReduce jobs. These clusters are called Redundant Array of Independent (and
Inexpensive) Nodes (RAIN).
￿
Fault-tolerant yet easy to administer : MapReduce jobs can run on clusters with
thousands of nodes or even more. These nodes are not very reliable as at any point
in time, a certain percentage of these commodity nodes or hard drives will be out
of order. Hence, the MapReduce framework applies straightforward mechanisms
to replicate data and launch backup tasks so as to keep still-running processes
going. To handle crashed nodes, system administrators simply take crashed
hardware off-line. New nodes can be plugged in at any time without much
administrative hassle. There is no complicated backup, restore and recovery
configurations like the ones that can be seen in many DBMS.
￿
Highly parallel yet abstracted : The most important contribution of the MapRe-
duce framework is its ability to automatically support the parallelization of task
executions. Hence, it allows developers to focus mainly on the problem at hand
rather than worrying about the low level implementation details such as memory
management, file allocation, parallel, multi-threaded or network programming.
Moreover, MapReduce's shared-nothing architecture [ 215 ] makes it much more
scalable and ready for parallelization.
Hadoop [ 9 ] is an open source Java library [ 230 ] that supports data-intensive
distributed applications by realizing the implementation of the MapReduce frame-
work. 1 It has been widely used by a large number of business companies for
production purposes. 2 On the implementation level, the Map invocations of a
MapReduce job are distributed across multiple machines by automatically parti-
tioning the input data into a set of M splits. The input splits can be processed in
parallel by different machines. Reduce invocations are distributed by partitioning the
intermediate key space into R pieces using a partitioning function (e.g. hash(key)
mod R). The number of partitions (R) and the partitioning function are specified
by the user. Figure 9.2 illustrates an example of the overall flow of a MapReduce
operation which goes through the following sequence of actions:
1. The input data of the MapReduce program is split into M pieces and starts up
many instances of the program on a cluster of machines.
2. One of the instances of the program is elected to be the master copy while the
rest are considered as workers that are assigned their work by the master copy. In
particular, there are M map tasks and R reduce tasks to assign. The master picks
idle workers and assigns each one or more map tasks and/or reduce tasks.
3. A worker who is assigned a map task processes the contents of the corresponding
input split and generates key/value pairs from the input data and passes each pair
to the user-defined Map function. The intermediate key/value pairs produced by
the Map function are buffered in memory.
1 In the rest of this chapter, we use the two names: MapReduce and Hadoop, interchangeably.
2 http://wiki.apache.org/hadoop/PoweredBy .
Search WWH ::




Custom Search