Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

Less frequent restarts are often a sign of other problems. One restart every hour is not

cause for alarm but it should be investigated. Often these slower restart issues are detected

by the monitoring system rather than the process watcher.

Automated Crash Data Collection and Analysis

Every crash should be logged. Crashes usually leave behind a lot of information in a crash

report. The crash report includes statistics such as amount of RAM and CPU usage at the

time of the process's death, as well as detailed information such as a traceback of which

function call and line of code was executing when the problem occurred. A coredump —a

file containing the contents of the process's memory—is often written out during a crash.

Developers use this file to aid debugging.

Automated collection and storage of crash reports is useful because this information

may be lost if it is not collected quickly; the information may be deleted or the machine

may go away. Collecting the information is inconvenient for humans but easy for automa-

tion. This is especially true in a system with hundreds of machines and hundreds of thou-

sandsofprocesses.Storingthereportscentrallypermitsdataminingandanalysis.Asimple

analytical result, suchaswhich systems crash the most, can beauseful engineering metric.

More intricate analysis can find bugs in common software libraries, the operating system,

hardware, or even particular chips.

6.5.2 Software Hangs

Sometimeswhensoftwarehasaproblemitdoesnotcrash,butinsteadhangsorgetscaught

in an infinite loop.

A strategy for detecting hangs is to monitor the server and detect if it has stopped pro-

cessing requests. We can passively observe request counts or actively test the system by

sending requests and verifying that a reply is generated within a certain amount of time.

These active requests, which are called pings, are designed to be light-weight, simply veri-

fying basic functionality.

If pings are sent at a specific, periodic rate and are used to detect hangs as well as

crashes,theyarecalled heartbeat requests .Whenhangsaredetected, anerrorcanbegen-

erated, an alert sent, or an attempt to restart the service can be made. If the server is one

of many replicas behind a load balancer, rather than simply restarting it, you can remove

it from the load balancing rotation and investigate the problem. Sometimes adding a new

replica issignificantly more workthan returning areplica that hasbeen repaired toservice.

For example, in the Google File System, a new replica added to the system requires rep-

licating possibly tera-bytes of files. This can flood the network. Fixing a hung replica and

returningittoservice simplyresults intheexisting databeingrevalidated, whichisamuch

more light-weight task.

Search WWH ::

Custom Search

Home