MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

Pregel and Giraph

Like MapReduce, Pregel was developed originally at Google. Also like MapReduce, there is an Apache, open-source

equivalent, called Giraph.

In Example 2.6 it is not essential to have two kinds of tasks. Rather, Join tasks could

eliminate duplicates as they are received, since they must store their previously received

inputs anyway. However, this arrangement has an advantage when we must recover from

a task failure. If each task stores all the output files it has ever created, and we place Join

tasks on different racks from the Dup-elim tasks, then we can deal with any single compute

node or single rack failure. That is, a Join task needing to be restarted can get all the previ-

ously generated inputs that it needs from the Dup-elim tasks, and vice-versa.

In the particular case of computing transitive closure, it is not necessary to prevent a

restarted task from generating outputs that the original task generated previously. In the

computation of the transitive closure, the rediscovery of a path does not influence the even-

tual answer. However, many computations cannot tolerate a situation where both the ori-

ginal and restarted versions of a task pass the same output to another task. For example, if

the final step of the computation were an aggregation, say a count of the number of nodes

reached by each node in the graph, then we would get the wrong answer if we counted a

path twice. In such a case, the master controller can record what files each task generated

and passed to other tasks. It can then restart a failed task and ignore those files when the

restarted version produces them a second time.

2.4.3

Pregel

Another approach to managing failures when implementing recursive algorithms on a com-

puting cluster is represented by the Pregel system. This system views its data as a graph.

Each node of the graph corresponds roughly to a task (although in practice many nodes of a

large graph would be bundled into a single task, as in the Join tasks of Example 2.6 ) . Each

graph node generates output messages that are destined for other nodes of the graph, and

each graph node processes the inputs it receives from other nodes.

EXAMPLE 2.7 Suppose our data is a collection of weighted arcs of a graph, and we want to

find, for each node of the graph, the length of the shortest path to each of the other nodes.

Initially, each graph node a stores the set of pairs ( b, w ) such that there is an arc from a to

b of weight w . These facts are initially sent to all other nodes, as triples ( a, b, w ). 6 When

the node a receives a triple ( c, d, w ), it looks up its current distance to c ; that is, it finds the

pair ( c, v ) stored locally, if there is one. It also finds the pair ( d, u ) if there is one. If w + v <

Search WWH ::

Custom Search

Home