MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

u , then the pair ( d, u ) is replaced by ( d, w + v ), and if there was no pair ( d, u ), then the pair

( d, w + v ) is stored at the node a . Also, the other nodes are sent the message ( a, d, w + v ) in

either of these two cases.

□

Computations in Pregel are organized into supersteps . In one superstep, all the messages

that were received by any of the nodes at the previous superstep (or initially, if it is the first

superstep) are processed, and then all the messages generated by those nodes are sent to

their destination.

In case of a compute-node failure, there is no attempt to restart the failed tasks at that

compute node. Rather, Pregel checkpoints its entire computation after some of the super-

steps. A checkpoint consists of making a copy of the entire state of each task, so it can be

restarted from that point if necessary. If any compute node fails, the entire job is restarted

from the most recent checkpoint.

Although this recovery strategy causes many tasks that have not failed to redo their

work, it is satisfactory in many situations. Recall that the reason MapReduce systems sup-

port restart of only the failed tasks is that we want assurance that the expected time to com-

plete the entire job in the face of failures is not too much greater than the time to run the

job with no failures. Any failure-management system will have that property as long as the

time to recover from a failure is much less than the average time between failures. Thus, it

is only necessary that Pregel checkpoints its computation after a number of supersteps such

that the probability of a failure during that number of supersteps is low.

2.4.4

Exercises for Section 2.4

! EXERCISE 2.4.1 Suppose a job consists of n tasks, each of which takes time t seconds.

Thus, if there are no failures, the sum over all compute nodes of the time taken to execute

tasks at that node is nt . Suppose also that the probability of a task failing is p per job per

second, and when a task fails, the overhead of management of the restart is such that it adds

10 t seconds to the total execution time of the job. What is the total expected execution time

of the job?

! EXERCISE 2.4.2 Suppose a Pregel job has a probability p of a failure during any superstep.

Suppose also that the execution time (summed over all compute nodes) of taking a check-

point is c times the time it takes to execute a superstep. To minimize the expected execution

time of the job, how many supersteps should elapse between checkpoints?

Search WWH ::

Custom Search

Home