Cloudware Application Development - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

related to a particular key, outputting a set of merged output values (usually

just one). MapReduce is often explained illustrating a possible solution to the

problem of counting the number of occurrences of each word in a large col-

lection of documents. The following pseudocode refers to the functions that

need to be implemented:

map(String input_key, String input_value):

//input_key: document name

//input_value: document contents

for each word w in input_value:

EmitIntermediate(w, "1");

reduce(String output_key,

Iterator intermediate_values):

//output_key: a word

//output_values: a list of counts

int result = 0;

for each v in intermediate_values:

result + = ParseInt(v);

Emit(AsString(result));

The map function emits in output each word together with an associated

count of occurrences (in this simple example just one). The reduce func-

tion provides the required result by summing all of the counts emitted for

a specific word. MapReduce implementations (e.g., Google App Engine and

Hadoop) then automatically parallelize and execute the program on a large

cluster of commodity machines. The runtime system takes care of the details

of partitioning the input data, scheduling the program's execution across a

set of machines, handling machine failures, and managing required inter-

machine communication.

The programming model for MapReduce architecture is a simple abstrac-

tion where the computation takes a set of input key-value pairs associated

with the input data and produces a set of output key-value pairs. The overall

model for this process is shown in Figure 17.1. In the map phase, the input

data are partitioned into input splits and assigned to map tasks associated

with processing nodes in the cluster. The map task typically executes on the

same node containing its assigned partition of data in the cluster. These map

tasks perform user-specified computations on each input key-value pair

from the partition of input data assigned to the task and generate a set of

intermediate results for each key. The shuffle and sort phase then takes the

intermediate data generated by each map task, sorts these data with interme-

diate data from other nodes, divides these data into regions to be processed

by the reduce tasks, and distributes these data as needed to nodes where the

reduce tasks will execute. All map tasks must complete prior to the shuffle

and sort and reduce phases. The number of reduce tasks does not need to be

the same as the number of map tasks. The reduce tasks perform additional

user-specified operations on the intermediate data possibly merging values

associated with a key to a smaller set of values to produce the output data.

Guide to Cloud Computing for Business and Technology Managers

Search WWH ::

Custom Search

Home