Information Technology Reference
In-Depth Information
Less frequent restarts are often a sign of other problems. One restart every hour is not
cause for alarm but it should be investigated. Often these slower restart issues are detected
by the monitoring system rather than the process watcher.
Automated Crash Data Collection and Analysis
Every crash should be logged. Crashes usually leave behind a lot of information in a crash
report. The crash report includes statistics such as amount of RAM and CPU usage at the
time of the process's death, as well as detailed information such as a traceback of which
function call and line of code was executing when the problem occurred. A coredump —a
file containing the contents of the process's memory—is often written out during a crash.
Developers use this file to aid debugging.
Automated collection and storage of crash reports is useful because this information
may be lost if it is not collected quickly; the information may be deleted or the machine
may go away. Collecting the information is inconvenient for humans but easy for automa-
tion. This is especially true in a system with hundreds of machines and hundreds of thou-
sandsofprocesses.Storingthereportscentrallypermitsdataminingandanalysis.Asimple
analytical result, suchaswhich systems crash the most, can beauseful engineering metric.
More intricate analysis can find bugs in common software libraries, the operating system,
hardware, or even particular chips.
6.5.2 Software Hangs
Sometimeswhensoftwarehasaproblemitdoesnotcrash,butinsteadhangsorgetscaught
in an infinite loop.
A strategy for detecting hangs is to monitor the server and detect if it has stopped pro-
cessing requests. We can passively observe request counts or actively test the system by
sending requests and verifying that a reply is generated within a certain amount of time.
These active requests, which are called pings, are designed to be light-weight, simply veri-
fying basic functionality.
If pings are sent at a specific, periodic rate and are used to detect hangs as well as
crashes,theyarecalled heartbeat requests .Whenhangsaredetected, anerrorcanbegen-
erated, an alert sent, or an attempt to restart the service can be made. If the server is one
of many replicas behind a load balancer, rather than simply restarting it, you can remove
it from the load balancing rotation and investigate the problem. Sometimes adding a new
replica issignificantly more workthan returning areplica that hasbeen repaired toservice.
For example, in the Google File System, a new replica added to the system requires rep-
licating possibly tera-bytes of files. This can flood the network. Fixing a hung replica and
returningittoservice simplyresults intheexisting databeingrevalidated, whichisamuch
more light-weight task.
Search WWH ::




Custom Search