Databases Reference
In-Depth Information
If w + v < u, then the pair (d, u) is replaced by (d, w + v), and if there was
no pair (d, u), then the pair (d, w + v) is stored at the node a. Also, the other
nodes are sent the message (a, d, w + v) in either of these two cases.
2
Computations in Pregel are organized into supersteps. In one superstep, all
the messages that were received by any of the nodes at the previous superstep
(or initially, if it is the first superstep) are processed, and then all the messages
generated by those nodes are sent to their destination.
In case of a compute-node failure, there is no attempt to restart the failed
tasks at that compute node. Rather, Pregel checkpoints its entire computation
after some of the supersteps. A checkpoint consists of making a copy of the
entire state of each task, so it can be restarted from that point if necessary.
If any compute node fails, the entire job is restarted from the most recent
checkpoint.
Although this recovery strategy causes many tasks that have not failed to
redo their work, it is satisfactory in many situations. Recall that the reason
map-reduce systems support restart of only the failed tasks is that we want
assurance that the expected time to complete the entire job in the face of fail-
ures is not too much greater than the time to run the job with no failures.
Any failure-management system will have that property as long as the time
to recover from a failure is much less than the average time between failures.
Thus, it is only necessary that Pregel checkpoints its computation after a num-
ber of supersteps such that the probability of a failure during that number of
supersteps is low.
2.4.4 Exercises for Section 2.4
! Exercise 2.4.1 : Suppose a job consists of n tasks, each of which takes time t
seconds. Thus, if there are no failures, the sum over all compute nodes of the
time taken to execute tasks at that node is nt. Suppose also that the probability
of a task failing is p per job per second, and when a task fails, the overhead of
management of the restart is such that it adds 10t seconds to the total execution
time of the job. What is the total expected execution time of the job?
! Exercise 2.4.2 : Suppose a Pregel job has a probability p of a failure during
any superstep. Suppose also that the execution time (summed over all compute
nodes) of taking a checkpoint is c times the time it takes to execute a superstep.
To minimize the expected execution time of the job, how many supersteps
should elapse between checkpoints?
2.5
E ciency of Cluster-Computing Algorithms
In this section we shall introduce a model for measuring the quality of algorithms
implemented on a computing cluster of the type so far discussed in this chapter.
We assume the computation is described by an acyclic workflow, as discussed
Search WWH ::




Custom Search