Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

If w + v < u, then the pair (d, u) is replaced by (d, w + v), and if there was

no pair (d, u), then the pair (d, w + v) is stored at the node a. Also, the other

nodes are sent the message (a, d, w + v) in either of these two cases.

2

Computations in Pregel are organized into supersteps. In one superstep, all

the messages that were received by any of the nodes at the previous superstep

(or initially, if it is the first superstep) are processed, and then all the messages

generated by those nodes are sent to their destination.

In case of a compute-node failure, there is no attempt to restart the failed

tasks at that compute node. Rather, Pregel checkpoints its entire computation

after some of the supersteps. A checkpoint consists of making a copy of the

entire state of each task, so it can be restarted from that point if necessary.

If any compute node fails, the entire job is restarted from the most recent

checkpoint.

Although this recovery strategy causes many tasks that have not failed to

redo their work, it is satisfactory in many situations. Recall that the reason

map-reduce systems support restart of only the failed tasks is that we want

assurance that the expected time to complete the entire job in the face of fail-

ures is not too much greater than the time to run the job with no failures.

Any failure-management system will have that property as long as the time

to recover from a failure is much less than the average time between failures.

Thus, it is only necessary that Pregel checkpoints its computation after a num-

ber of supersteps such that the probability of a failure during that number of

supersteps is low.

2.4.4 Exercises for Section 2.4

! Exercise 2.4.1 : Suppose a job consists of n tasks, each of which takes time t

seconds. Thus, if there are no failures, the sum over all compute nodes of the

time taken to execute tasks at that node is nt. Suppose also that the probability

of a task failing is p per job per second, and when a task fails, the overhead of

management of the restart is such that it adds 10t seconds to the total execution

time of the job. What is the total expected execution time of the job?

! Exercise 2.4.2 : Suppose a Pregel job has a probability p of a failure during

any superstep. Suppose also that the execution time (summed over all compute

nodes) of taking a checkpoint is c times the time it takes to execute a superstep.

To minimize the expected execution time of the job, how many supersteps

should elapse between checkpoints?

2.5

E ciency of Cluster-Computing Algorithms

In this section we shall introduce a model for measuring the quality of algorithms

implemented on a computing cluster of the type so far discussed in this chapter.

We assume the computation is described by an acyclic workflow, as discussed

Search WWH ::

Custom Search

Home