Cloudware Basics - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

does not receive a specified number of consecutive heartbeats, it may declare

the application instance as failed. The second is the method of probes. Here,

the monitoring service periodically sends a probe, which is a lightweight ser-

vice request, to the application instance. If the instance does not respond to a

specified number of probes, it may be considered failed. There is a trade-off

between speed and accuracy of detecting failures. To detect failures rapidly,

it may be desirable to set a low value for the number of missed heartbeats or

probes. However, this could lead to an increase in the number of false fail-

ures. An application instance may not respond due to a momentary overload

or some other transient condition. Since the consequences of falsely declar-

ing an application instance failed are severe, generally a high threshold is

set for the number of missed heartbeats or probes to virtually eliminate the

likelihood of falsely declaring an instance failed.

13.6.3.1.2 Redirection

After identifying failed instances, it is necessary to avoid routing new requests

to these instances. A common mechanism used for this in HTTP-based

protocols is HTTP redirection.

13.6.3.2 Application Recovery

In addition to directing new requests to a server that is up, it is necessary to

recover old requests. An application-independent method of doing this is

checkpoint/restart. Here, the cloud infrastructure periodically saves the state of

the application. If the application is determined to have failed, the most recent

checkpoint can be activated, and the application can resume from that state.

13.6.3.2.1 Checkpoint/Restart Paradigm

Checkpoint/restart can give rise to a number of complexities. First, the infra-

structure should checkpoint all resources, such as system memory, other-

wise the memory of the restarted application may not be consistent with

the rest of the restarted application. Checkpointing storage will normally

require support from the storage or file system, since any updates that were

performed have to be rolled back. This could be complex in a distributed

application, since updates by a failed instance could be intermingled with

updates from running instances. Also, it is difficult to capture and reproduce

activity on the network between distributed processes.

In a distributed checkpoint/restart, all processes of distributed appli-

cation instances are checkpointed, and all instances are restarted from a

common checkpoint if any instance fails. This has obvious scalability limi-

tations and also suffers from correctness issues if any interprocess com-

munication data is in-transit at the time of failure. For instance, Ubuntu

Linux has support for checkpoint/restart of distributed programs. Even

sequential applications can be transparently checkpointed if linked with

Search WWH ::

Custom Search

Home