Database Reference
In-Depth Information
<
Hadoop* is an open-source Java library [131] that supports data-intensive distrib-
uted applications by realizing the implementation of the MapReduce framework.
It has been widely used by a large number of business companies for production
purposes. On the implementation level, the map invocations of a MapReduce job are
distributed across multiple machines by automatically partitioning the input data into
a set of M splits. The input splits can be processed in parallel by different machines.
Reduce invocations are distributed by partitioning the intermediate key space into
R pieces using a partitioning function (e.g., hash[key] mod R). The number of parti-
tions (R) and the partitioning function are specified by the user. Figure 2.2 illustrates
an example of the overall flow of a MapReduce operation that goes through the fol-
lowing sequence of actions:
1. The input data of the MapReduce program is split into M pieces and starts
up many instances of the program on a cluster of machines.
2. One of the instances of the program is elected to be the master copy while the
rest are considered as workers that are assigned their work by the master copy.
In particular, there are M map tasks and R reduce tasks to assign. The master
picks idle workers and assigns each one or more map tasks and/or reduce tasks.
3. A worker who is assigned a map task processes the contents of the corre-
sponding input split and generates key/value pairs from the input data and
passes each pair to the user-defined map function. The intermediate key/
value pairs produced by the map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk and partitioned
into R regions by the partitioning function. The locations of these buffered
pairs on the local disk are passed back to the master, who is responsible for
forwarding these locations to the reduce workers.
5. When a reduce worker is notified by the master about these locations, it
reads the buffered data from the local disks of the map workers that is then
sorted by the intermediate keys so that all occurrences of the same key are
grouped together. The sorting operation is needed because typically many
different keys map to the same reduce task.
6. The reduce worker passes the key and the corresponding set of intermedi-
ate values to the user's reduce function. The output of the reduce function is
appended to a final output file for this reduce partition.
7. When all map tasks and reduce tasks have been completed, the master pro-
gram wakes up the user program. At this point, the MapReduce invocation
in the user program returns the program control back to the user code.
During the execution process, the master pings every worker periodically. If no
response is received from a worker within a certain amount of time, the master marks
the worker as failed . Any map tasks marked completed or in progress by the worker
are reset back to their initial idle state and therefore become eligible for scheduling
* http://hadoop.apache.org/.
In the rest of this chapter, we use the two names: MapReduce and Hadoop, interchangeably.
http://wiki.apache.org/hadoop/PoweredBy.
Search WWH ::




Custom Search