iMapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

KVs). Then, a series of MapReduce jobs update the state KVs based on the

structure KVs.

•

Use of State/Structure KVs. In each iteration, a mapper operating on a

structure KV produces intermediate KVs based the state KVs, and a reducer

updates a state KV. For example, in PageRank, the mapper operating on

each node needs a node's PageRank score (state KV) and its outgoing neighbors

(structure KV), and the reducer updates a node's PageRank score (state KV);

In K-means, the mapper operating on each point needs the coordinates of all

centroids (state KVs) and its own coordinate (structure KV), and the reducer

updates a centroid's coordinate (state KV); In MPI, the mapper 2 operating on

each column of M needs the j th column of N (state KV) and the j th column of

M (structure KV), and the reducer 2 updates a column of N (state KV).

•

Mappings between Reducers and Mappers. The map function operates

on a structure KV and produces intermediate results based on one or more

state KVs. Accordingly, a structure KV has different mappings with the

state KVs. For example, in PageRank, the mapper operates on a node's

PageRank score and the same node's outgoing neighbors, which is “one-

to-one” mapping. In K-means, the mapper operates on a point and all cen-

troids, which is “one-to-all” mapping. In MPI, the mapper 1 operates on an

entry ( i , j ) of M and the entries of the j th row of N , which is “one-to-more”

mapping. To sum up, a mapper operates on a structure KV, and a reducer

updates a state KV. That is, the mappings between a reducer of the i th job

and the mappers of the ( i + 1)th job can be different.

These observations reflect the properties of iterative computations in MapReduce,

which are very helpful. Programmers can first find the state/structure KVs and then

define the iterative data flow to design the MapReduce programs. Moreover, these

properties are useful for us to design an efficient programming model for iterative

computations. iMapReduce is proposed based on these properties.

3.1.4 l imiteD s uPPort oF i iterative C omPutations by m aP r eDuCe

As described above, MapReduce can be used to implement iterative algorithms.

However, MapReduce has limited support for iterative computations. We list three

performance penalties of implementing iterative algorithms in Hadoop MapReduce.

1. The operations in all iterations are the same. Nevertheless, MapReduce

implementation starts one/more new job(s) for each iteration, which involves

repeated task initializations and cleanups. Moreover, these jobs have to load

the input data from DFS and dump the output data to DFS repeatedly. It is

known that Hadoop usually has around 20 seconds of start-up overhead [7].

A series of jobs might result in the unnecessary job startup overhead .

2. The map tasks in an iteration cannot start before finishing all the reduce

tasks in the previous iteration. The main loop in the MapReduce implemen-

tation requires the completion of the previous iteration job before starting

the next iteration job. However, the map tasks should be started as soon as

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home