Information Technology Reference
In-Depth Information
does not receive a specified number of consecutive heartbeats, it may declare
the application instance as failed. The second is the method of probes. Here,
the monitoring service periodically sends a probe, which is a lightweight ser-
vice request, to the application instance. If the instance does not respond to a
specified number of probes, it may be considered failed. There is a trade-off
between speed and accuracy of detecting failures. To detect failures rapidly,
it may be desirable to set a low value for the number of missed heartbeats or
probes. However, this could lead to an increase in the number of false fail-
ures. An application instance may not respond due to a momentary overload
or some other transient condition. Since the consequences of falsely declar-
ing an application instance failed are severe, generally a high threshold is
set for the number of missed heartbeats or probes to virtually eliminate the
likelihood of falsely declaring an instance failed.
13.6.3.1.2 Redirection
After identifying failed instances, it is necessary to avoid routing new requests
to these instances. A common mechanism used for this in HTTP-based
protocols is HTTP redirection.
13.6.3.2 Application Recovery
In addition to directing new requests to a server that is up, it is necessary to
recover old requests. An application-independent method of doing this is
checkpoint/restart. Here, the cloud infrastructure periodically saves the state of
the application. If the application is determined to have failed, the most recent
checkpoint can be activated, and the application can resume from that state.
13.6.3.2.1 Checkpoint/Restart Paradigm
Checkpoint/restart can give rise to a number of complexities. First, the infra-
structure should checkpoint all resources, such as system memory, other-
wise the memory of the restarted application may not be consistent with
the rest of the restarted application. Checkpointing storage will normally
require support from the storage or file system, since any updates that were
performed have to be rolled back. This could be complex in a distributed
application, since updates by a failed instance could be intermingled with
updates from running instances. Also, it is difficult to capture and reproduce
activity on the network between distributed processes.
In a distributed checkpoint/restart, all processes of distributed appli-
cation instances are checkpointed, and all instances are restarted from a
common checkpoint if any instance fails. This has obvious scalability limi-
tations and also suffers from correctness issues if any interprocess com-
munication data is in-transit at the time of failure. For instance, Ubuntu
Linux has support for checkpoint/restart of distributed programs. Even
sequential applications can be transparently checkpointed if linked with
Search WWH ::




Custom Search