Database Reference
In-Depth Information
u , then the pair ( d, u ) is replaced by ( d, w + v ), and if there was no pair ( d, u ), then the pair
( d, w + v ) is stored at the node a . Also, the other nodes are sent the message ( a, d, w + v ) in
either of these two cases.
Computations in Pregel are organized into supersteps . In one superstep, all the messages
that were received by any of the nodes at the previous superstep (or initially, if it is the first
superstep) are processed, and then all the messages generated by those nodes are sent to
their destination.
In case of a compute-node failure, there is no attempt to restart the failed tasks at that
compute node. Rather, Pregel checkpoints its entire computation after some of the super-
steps. A checkpoint consists of making a copy of the entire state of each task, so it can be
restarted from that point if necessary. If any compute node fails, the entire job is restarted
from the most recent checkpoint.
Although this recovery strategy causes many tasks that have not failed to redo their
work, it is satisfactory in many situations. Recall that the reason MapReduce systems sup-
port restart of only the failed tasks is that we want assurance that the expected time to com-
plete the entire job in the face of failures is not too much greater than the time to run the
job with no failures. Any failure-management system will have that property as long as the
time to recover from a failure is much less than the average time between failures. Thus, it
is only necessary that Pregel checkpoints its computation after a number of supersteps such
that the probability of a failure during that number of supersteps is low.
2.4.4
Exercises for Section 2.4
! EXERCISE 2.4.1 Suppose a job consists of n tasks, each of which takes time t seconds.
Thus, if there are no failures, the sum over all compute nodes of the time taken to execute
tasks at that node is nt . Suppose also that the probability of a task failing is p per job per
second, and when a task fails, the overhead of management of the restart is such that it adds
10 t seconds to the total execution time of the job. What is the total expected execution time
of the job?
! EXERCISE 2.4.2 Suppose a Pregel job has a probability p of a failure during any superstep.
Suppose also that the execution time (summed over all compute nodes) of taking a check-
point is c times the time it takes to execute a superstep. To minimize the expected execution
time of the job, how many supersteps should elapse between checkpoints?
Search WWH ::




Custom Search