Database Reference
In-Depth Information
Initial state
A checkpoint
D 1 : A recovery line
D 2 : Not a recovery line
P
m 2
m 1
A failure
Q
Message sent from
process Q to process P
The two local checkpoints at P and Q jointly form a
distributed checkpoint
FIGURE 1.16 Demonstrating distributed checkpointing. D 1 is a valid distributed check-
point while D 2 is not, due to being inconsistent. Specifically, D 2 's checkpoint at Q indicates
that m 2 has been received, while D 2 's checkpoint at P does not indicate that m 2 has been sent.
Rollback
D 1 : Not a recovery line
D 2 : Not a recovery line
D 3 : Not a recovery line
P
A failure
Q
FIGURE 1.17 The domino effect that might result from rolling back each process (e.g.,
processes P and Q ) to a saved local checkpoint in order to locate a recovery line. Neither D 1 ,
D 2 , nor D 3 are recovery lines because they exhibit inconsistent global states.
distributed checkpoint in Figure 1.17 is indeed inconsistent. This makes distributed
checkpointing a costly operation, which might not converge to an acceptable recov-
ery solution. Accordingly, many fault-tolerant distributed systems combine check-
pointing with message logging . One way to achieve that is by logging messages of
a process before sending them off and after a checkpoint has been taken. Obviously,
this solves the problem of D 2 at Figure 1.16, for example. In particular, after the D 2 's
checkpoint at P is taken, the send of m 2 will be marked in a log message at P , which
if merged with D 2 's checkpoint at Q , can form a global consistent state. Apart from
distributed analytics engines, Hadoop Distributed File System (HDFS) combines
distributed checkpointing (i.e., the image file) ) and message logging (i.e., the edit file) )
to recover slave (or NameNode in HDFS's parlance) failures [27]. As examples from
analytics engines, distributed checkpointing alone is adopted by GraphLab, while
message logging and distributed checkpointing combined together are employed by
Pregel.
Search WWH ::




Custom Search