Database Reference
In-Depth Information
Pregel and Giraph
Like MapReduce, Pregel was developed originally at Google. Also like MapReduce, there is an Apache, open-source
equivalent, called Giraph.
In Example 2.6 it is not essential to have two kinds of tasks. Rather, Join tasks could
eliminate duplicates as they are received, since they must store their previously received
inputs anyway. However, this arrangement has an advantage when we must recover from
a task failure. If each task stores all the output files it has ever created, and we place Join
tasks on different racks from the Dup-elim tasks, then we can deal with any single compute
node or single rack failure. That is, a Join task needing to be restarted can get all the previ-
ously generated inputs that it needs from the Dup-elim tasks, and vice-versa.
In the particular case of computing transitive closure, it is not necessary to prevent a
restarted task from generating outputs that the original task generated previously. In the
computation of the transitive closure, the rediscovery of a path does not influence the even-
tual answer. However, many computations cannot tolerate a situation where both the ori-
ginal and restarted versions of a task pass the same output to another task. For example, if
the final step of the computation were an aggregation, say a count of the number of nodes
reached by each node in the graph, then we would get the wrong answer if we counted a
path twice. In such a case, the master controller can record what files each task generated
and passed to other tasks. It can then restart a failed task and ignore those files when the
restarted version produces them a second time.
2.4.3
Pregel
Another approach to managing failures when implementing recursive algorithms on a com-
puting cluster is represented by the Pregel system. This system views its data as a graph.
Each node of the graph corresponds roughly to a task (although in practice many nodes of a
large graph would be bundled into a single task, as in the Join tasks of Example 2.6 ) . Each
graph node generates output messages that are destined for other nodes of the graph, and
each graph node processes the inputs it receives from other nodes.
EXAMPLE 2.7 Suppose our data is a collection of weighted arcs of a graph, and we want to
find, for each node of the graph, the length of the shortest path to each of the other nodes.
Initially, each graph node a stores the set of pairs ( b, w ) such that there is an arc from a to
b of weight w . These facts are initially sent to all other nodes, as triples ( a, b, w ). 6 When
the node a receives a triple ( c, d, w ), it looks up its current distance to c ; that is, it finds the
pair ( c, v ) stored locally, if there is one. It also finds the pair ( d, u ) if there is one. If w + v <
Search WWH ::




Custom Search