Information Technology Reference
In-Depth Information
related to a particular key, outputting a set of merged output values (usually
just one). MapReduce is often explained illustrating a possible solution to the
problem of counting the number of occurrences of each word in a large col-
lection of documents. The following pseudocode refers to the functions that
need to be implemented:
map(String input_key, String input_value):
//input_key: document name
//input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key,
Iterator intermediate_values):
//output_key: a word
//output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result + = ParseInt(v);
Emit(AsString(result));
The map function emits in output each word together with an associated
count of occurrences (in this simple example just one). The reduce func-
tion provides the required result by summing all of the counts emitted for
a specific word. MapReduce implementations (e.g., Google App Engine and
Hadoop) then automatically parallelize and execute the program on a large
cluster of commodity machines. The runtime system takes care of the details
of partitioning the input data, scheduling the program's execution across a
set of machines, handling machine failures, and managing required inter-
machine communication.
The programming model for MapReduce architecture is a simple abstrac-
tion where the computation takes a set of input key-value pairs associated
with the input data and produces a set of output key-value pairs. The overall
model for this process is shown in Figure 17.1. In the map phase, the input
data are partitioned into input splits and assigned to map tasks associated
with processing nodes in the cluster. The map task typically executes on the
same node containing its assigned partition of data in the cluster. These map
tasks perform user-specified computations on each input key-value pair
from the partition of input data assigned to the task and generate a set of
intermediate results for each key. The shuffle and sort phase then takes the
intermediate data generated by each map task, sorts these data with interme-
diate data from other nodes, divides these data into regions to be processed
by the reduce tasks, and distributes these data as needed to nodes where the
reduce tasks will execute. All map tasks must complete prior to the shuffle
and sort and reduce phases. The number of reduce tasks does not need to be
the same as the number of map tasks. The reduce tasks perform additional
user-specified operations on the intermediate data possibly merging values
associated with a key to a smaller set of values to produce the output data.
Search WWH ::




Custom Search