iMapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

The map tasks scheduled on the cluster nodes will then process these splits in a

distributed manner. A map task with a map function processes a set of KVs, pro-

duces a new set of intermediate KVs, and ends its life cycle when its assigned input

KVs are processed. In the shuffle phase, the map tasks send their output KVs to the

reduce tasks. An all-to-all communication between map tasks and reduce tasks is

performed. Finally, in the reduce phase, each reduce task with a reduce function

integrates the received intermediate KVs, produces the final result KVs, and ends

its life cycle when its received intermediate KVs are processed. A MapReduce job

returns when all the map tasks and reduce tasks have completed. The final output

data containing the result KVs is written to DFS, which can be accessed by users.

3.1.2 e XamPles oF i imPlementing i iterative a lgorithms in m aP r eDuCe

To implement iterative computations in MapReduce, users have to submit a series of

MapReduce jobs. One iteration corresponds to one or more MapReduce jobs. The

previous iteration job's output is feed to the next iteration job as input. In this sec-

tion, three iterative algorithm examples and their MapReduce implementations are

provided.

3.1.2.1 PageRank

PageRank [1] is a popular algorithm initially proposed for ranking web pages. Later

on, it has been used in a wide range of applications, such as link prediction and rec-

ommendation systems.

The PageRank vector R is defined over a directed graph G = ( V , E ). Each node v

in the graph is associated with a PageRank score R ( v ). The initial rank of each node

is

1

V

. Each node v updates its rank score iteratively as follows:

()

k

1

−

d

dR u

Nu

()

∑

(

k

+

1

)

Rv

()

=

+

,

(1.1)

V

+

−

uN v

∈

()

where N − ( v ) is the set of nodes pointing to node v , N + ( v ) is the set of nodes that v

points to, k is the iteration number, and d is a constant representing the damping

factor. This iterative process continues for a fixed number of iterations or until the

difference between the resulting PageRank scores of two consecutive iterations is

smaller than a threshold.

In MapReduce, the map function is applied on each node u , where the input key is

the node id and the input value contains node u 's ranking score R ( u ) as well as node

u 's outbound neighbors set N + ( u ). The mapper on node u derives the partial ranking

Ru

Nu

()

score of v , v ∈ N + ( u ), that is, d

, that will be shuffled to node v . Meanwhile,

+

the retained PageRank score 1− d

V

and the outbound neighbors set N + ( u ) are shuffled

to itself. The reducer on node v accumulates these partial ranking scores and the

retained ranking score to produce a new ranking score of v . The updated ranking

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home