MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

without need to store the temporary file that is output of one MapReduce job in the dis-

tributed file system. By locating tasks at compute nodes that have a copy of their input, we

can avoid much of the communication that would be necessary if we stored the result of

one MapReduce job and then initiated a second MapReduce job (although Hadoop and oth-

er MapReduce systems also try to locate Map tasks where a copy of their input is already

present).

2.4.2

Recursive Extensions to MapReduce

Many large-scale computations are really recursions. An important example is PageRank,

which is the subject of Chapter 5 . That computation is, in simple terms, the computation

of the fixedpoint of a matrix-vector multiplication. It is computed under MapReduce sys-

tems by the iterated application of the matrix-vector multiplication algorithm described in

Section 2.3.1 , or by a more complex strategy that we shall introduce in Section 5.2 . The it-

eration typically continues for an unknown number of steps, each step being a MapReduce

job, until the results of two consecutive iterations are sufficiently close that we believe con-

vergence has occurred.

The reason recursions are normally implemented by iterated MapReduce jobs is that a

true recursive task does not have the property necessary for independent restart of failed

tasks. It is impossible for a collection of mutually recursive tasks, each of which has an

output that is input to at least some of the other tasks, to produce output only at the end of

the task. If they all followed that policy, no task would ever receive any input, and noth-

ing could be accomplished. As a result, some mechanism other than simple restart of failed

tasks must be implemented in a system that handles recursive workflows (flow graphs that

are not acyclic). We shall start by studying an example of a recursion implemented as a

workflow, and then discuss approaches to dealing with task failures.

EXAMPLE 2.6 Suppose we have a directed graph whose arcs are represented by the relation

E ( X, Y ), meaning that there is an arc from node X to node Y . We wish to compute the paths

relation P ( X, Y ), meaning that there is a path of length 1 or more from node X to node Y . A

simple recursive algorithm to do so is:

(1) Start with P ( X, Y ) = E ( X, Y ).

(2) While changes to the relation P occur, add to P all tuples in ()

π X , Y P ( X, Z ) ⋈ P ( Z, Y )

That is, find pairs of nodes X and Y such that for some node Z there is known to be a

path from X to Z and also a path from Z to Y .

Search WWH ::

Custom Search

Home