Information Technology Reference
In-Depth Information
6.5 Software Failures
As long as there has been software, there have been software bugs. Long-running software
can die unexpectedly. Software can hang and not respond. For all these reasons, software
needs resilience features, too.
6.5.1 Software Crashes
A common failure in a system is that software crashes, or prematurely exits. There are
many reasons software may crash and many ways to respond. Server software is generally
intended to be long lived. For example, a server that provides a particular API is expected
to run forever unless the configuration changes in a way that requires a restart or the ser-
vice is decommissioned.
There are two categories of crashes:
A regular crash occurs when the software does something prohibited by the oper-
ating system. For example, due to a software bug, the program may try to write to
memory that is marked read-only by the operating system. The OS detects this and
kills the process.
A panic occurs when the software itself detects something is wrong and decides
the best course is to exit. For example, the software may detect a situation that
shouldn't exist and cannot be corrected. The software's author may have decided
the safest thing to do in such a scenario is to exit. For example, if internal data
structures are corrupted and there is no safe way to rectify them, it is best to stop
work immediately rather than continue with bad data. A panic is, essentially, an in-
tentional crash.
Automated Restarts and Escalation
Theeasiestwaytodealwithasoftwarecrashistorestartthesoftware.Sometimestheprob-
lemistransientandarestartisallthatisneededtofixit.Suchrestartsshouldbeautomated.
With thousands of servers, it is inefficient for a human to constantly be checking processes
to see if they are down and restarting them as needed. A program that handles this task
called a process watcher .
However,restarting adownprocessisnotaseasyasitsounds.Ifitimmediately crashes
again and again, we need to do something else; otherwise, we will be wasting CPU time
without improving the situation. Usually the process watcher will detect that the process
has been restarted x times in y minutes and consider that behavior cause to escalate the is-
sue. Escalation involves not restarting the process and instead reporting the problem to a
human. An example threshold might be that something has restarted more than five times
in a minute.
Search WWH ::




Custom Search