Database Reference
In-Depth Information
without need to store the temporary file that is output of one MapReduce job in the dis-
tributed file system. By locating tasks at compute nodes that have a copy of their input, we
can avoid much of the communication that would be necessary if we stored the result of
one MapReduce job and then initiated a second MapReduce job (although Hadoop and oth-
er MapReduce systems also try to locate Map tasks where a copy of their input is already
present).
2.4.2
Recursive Extensions to MapReduce
Many large-scale computations are really recursions. An important example is PageRank,
which is the subject of Chapter 5 . That computation is, in simple terms, the computation
of the fixedpoint of a matrix-vector multiplication. It is computed under MapReduce sys-
tems by the iterated application of the matrix-vector multiplication algorithm described in
Section 2.3.1 , or by a more complex strategy that we shall introduce in Section 5.2 . The it-
eration typically continues for an unknown number of steps, each step being a MapReduce
job, until the results of two consecutive iterations are sufficiently close that we believe con-
vergence has occurred.
The reason recursions are normally implemented by iterated MapReduce jobs is that a
true recursive task does not have the property necessary for independent restart of failed
tasks. It is impossible for a collection of mutually recursive tasks, each of which has an
output that is input to at least some of the other tasks, to produce output only at the end of
the task. If they all followed that policy, no task would ever receive any input, and noth-
ing could be accomplished. As a result, some mechanism other than simple restart of failed
tasks must be implemented in a system that handles recursive workflows (flow graphs that
are not acyclic). We shall start by studying an example of a recursion implemented as a
workflow, and then discuss approaches to dealing with task failures.
EXAMPLE 2.6 Suppose we have a directed graph whose arcs are represented by the relation
E ( X, Y ), meaning that there is an arc from node X to node Y . We wish to compute the paths
relation P ( X, Y ), meaning that there is a path of length 1 or more from node X to node Y . A
simple recursive algorithm to do so is:
(1) Start with P ( X, Y ) = E ( X, Y ).
(2) While changes to the relation P occur, add to P all tuples in ()
π X , Y P ( X, Z ) P ( Z, Y )
That is, find pairs of nodes X and Y such that for some node Z there is known to be a
path from X to Z and also a path from Z to Y .
Search WWH ::




Custom Search