Information Technology Reference
In-Depth Information
6.8 Human Error
Aswedesignsystemstobemoreresilienttohardwareandsoftwarefailures,theremaining
failures are likely to be due to human error. While this sounds obvious, this trend was not
recognized until the groundbreaking paper “Why Do Internet Services Fail, and What Can
Be Done about It?” was published in 2003 ( Oppenheimer, Ganapathi & Patterson 2003 ).
Thestrategiesfordealingwithhumanerrorcanbecategorizedasgettingbetterhumans,
removing humans from the loop, and detecting human errors and working around them.
We get better humans by having better operational practices, especially those that exer-
cise the skills and behaviors that most need improvement. (See Chapter 15 . )
We remove humans from the loop through automation. Humans may get sloppy and not
do as much checking for errors during a procedure, but automation, once written, will al-
ways check its work (See Chapter 12 . )
Detecting human errors and working around them is also a function of automation. A
pre-check isautomationthatchecksinputsandpreventsaprocessfromrunningifthetests
fail. For example, a pre-check can verify that a recently edited configuration file has no
syntax errors and meets certain other quality criteria. Failing the pre-check would prevent
the configuration file from being put into use.
While pre-checks are intended to prevent problems, the reality is that they tend to lag
behind experience. That is, after each outage we add new pre-checks to prevent that same
human error from creating future outages.
Another common pre-check is for large changes. If a typical change usually consists of
only a few lines, a pre-check might require additional approval if the change is larger than
a particular number of lines. The change might be in the size of the input, the number of
changed lines between the current input and new input, or the number of changed lines
between the current and new output. For example, a configuration file may be used to con-
trol a system that generates other files. The growth of the output by more than a certain
percentage may trigger additional approval.
Another way to be resilient to human error is to have two humans check all changes.
Many source code control systems can be configured to not accept changes from a user
until a second user approves them. All system administration that is done via changes to
files in a source code repository are then checked by a second pair of eyes. This is a very
common operational method at Google.
6.9 Summary
Resiliency is a system's ability to constructively deal with failures. A resilient system de-
tects failure and routes around it.
Search WWH ::




Custom Search