Database Reference
In-Depth Information
The map tasks scheduled on the cluster nodes will then process these splits in a
distributed manner. A map task with a map function processes a set of KVs, pro-
duces a new set of intermediate KVs, and ends its life cycle when its assigned input
KVs are processed. In the shuffle phase, the map tasks send their output KVs to the
reduce tasks. An all-to-all communication between map tasks and reduce tasks is
performed. Finally, in the reduce phase, each reduce task with a reduce function
integrates the received intermediate KVs, produces the final result KVs, and ends
its life cycle when its received intermediate KVs are processed. A MapReduce job
returns when all the map tasks and reduce tasks have completed. The final output
data containing the result KVs is written to DFS, which can be accessed by users.
3.1.2 e XamPles oF i imPlementing i iterative a lgorithms in m aP r eDuCe
To implement iterative computations in MapReduce, users have to submit a series of
MapReduce jobs. One iteration corresponds to one or more MapReduce jobs. The
previous iteration job's output is feed to the next iteration job as input. In this sec-
tion, three iterative algorithm examples and their MapReduce implementations are
provided.
3.1.2.1 PageRank
PageRank [1] is a popular algorithm initially proposed for ranking web pages. Later
on, it has been used in a wide range of applications, such as link prediction and rec-
ommendation systems.
The PageRank vector R is defined over a directed graph G = ( V , E ). Each node v
in the graph is associated with a PageRank score R ( v ). The initial rank of each node
is
1
V
. Each node v updates its rank score iteratively as follows:
()
k
1
d
dR u
Nu
()
()
(
k
+
1
)
Rv
()
=
+
,
(1.1)
V
+
uN v
()
where N ( v ) is the set of nodes pointing to node v , N + ( v ) is the set of nodes that v
points to, k is the iteration number, and d is a constant representing the damping
factor. This iterative process continues for a fixed number of iterations or until the
difference between the resulting PageRank scores of two consecutive iterations is
smaller than a threshold.
In MapReduce, the map function is applied on each node u , where the input key is
the node id and the input value contains node u 's ranking score R ( u ) as well as node
u 's outbound neighbors set N + ( u ). The mapper on node u derives the partial ranking
Ru
Nu
()
()
score of v , v N + ( u ), that is, d
, that will be shuffled to node v . Meanwhile,
+
the retained PageRank score 1− d
V
and the outbound neighbors set N + ( u ) are shuffled
to itself. The reducer on node v accumulates these partial ranking scores and the
retained ranking score to produce a new ranking score of v . The updated ranking
Search WWH ::




Custom Search