Database Reference
In-Depth Information
Initial state
A checkpoint
D
1
: A recovery line
D
2
: Not a recovery line
P
m
2
m
1
A failure
Q
Message sent from
process
Q
to process
P
The two local checkpoints at
P
and
Q
jointly
form a
distributed checkpoint
FIGURE 1.16
Demonstrating distributed checkpointing.
D
1
is a valid distributed check-
point while
D
2
is not, due to being inconsistent. Specifically,
D
2
's checkpoint at
Q
indicates
that
m
2
has been received, while
D
2
's checkpoint at
P
does not indicate that
m
2
has been sent.
Rollback
D
1
: Not a recovery line
D
2
: Not a recovery line
D
3
: Not a recovery line
P
A failure
Q
FIGURE 1.17
The domino effect that might result from rolling back each process (e.g.,
processes
P
and
Q
) to a saved local checkpoint in order to locate a recovery line. Neither
D
1
,
D
2
, nor
D
3
are recovery lines because they exhibit inconsistent global states.
distributed checkpoint in Figure 1.17 is indeed inconsistent. This makes distributed
checkpointing a costly operation, which might not converge to an acceptable recov-
ery solution. Accordingly, many fault-tolerant distributed systems combine check-
pointing with
message logging
. One way to achieve that is by logging messages of
a process before sending them off and after a checkpoint has been taken. Obviously,
this solves the problem of
D
2
at Figure 1.16, for example. In particular, after the
D
2
's
checkpoint at
P
is taken, the send of
m
2
will be marked in a log message at
P
, which
if merged with
D
2
's checkpoint at
Q
, can form a global consistent state. Apart from
distributed analytics engines, Hadoop Distributed File System (HDFS) combines
distributed checkpointing (i.e., the
image file)
) and message logging (i.e., the
edit file)
)
to recover slave (or NameNode in HDFS's parlance) failures [27]. As examples from
analytics engines, distributed checkpointing alone is adopted by GraphLab, while
message logging and distributed checkpointing combined together are employed by
Pregel.
Search WWH ::
Custom Search