Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management - page 30

Database Reference

In-Depth Information

Initial state

A checkpoint

D 1 : A recovery line

D 2 : Not a recovery line

P

m 2

m 1

A failure

Q

Message sent from

process Q to process P

The two local checkpoints at P and Q jointly form a

distributed checkpoint

FIGURE 1.16 Demonstrating distributed checkpointing. D 1 is a valid distributed check-

point while D 2 is not, due to being inconsistent. Specifically, D 2 's checkpoint at Q indicates

that m 2 has been received, while D 2 's checkpoint at P does not indicate that m 2 has been sent.

Rollback

D 1 : Not a recovery line

D 2 : Not a recovery line

D 3 : Not a recovery line

P

A failure

Q

FIGURE 1.17 The domino effect that might result from rolling back each process (e.g.,

processes P and Q ) to a saved local checkpoint in order to locate a recovery line. Neither D 1 ,

D 2 , nor D 3 are recovery lines because they exhibit inconsistent global states.

distributed checkpoint in Figure 1.17 is indeed inconsistent. This makes distributed

checkpointing a costly operation, which might not converge to an acceptable recov-

ery solution. Accordingly, many fault-tolerant distributed systems combine check-

pointing with message logging . One way to achieve that is by logging messages of

a process before sending them off and after a checkpoint has been taken. Obviously,

this solves the problem of D 2 at Figure 1.16, for example. In particular, after the D 2 's

checkpoint at P is taken, the send of m 2 will be marked in a log message at P , which

if merged with D 2 's checkpoint at Q , can form a global consistent state. Apart from

distributed analytics engines, Hadoop Distributed File System (HDFS) combines

distributed checkpointing (i.e., the image file) ) and message logging (i.e., the edit file) )

to recover slave (or NameNode in HDFS's parlance) failures [27]. As examples from

analytics engines, distributed checkpointing alone is adopted by GraphLab, while

message logging and distributed checkpointing combined together are employed by

Pregel.

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home