Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

6.8 Human Error

Aswedesignsystemstobemoreresilienttohardwareandsoftwarefailures,theremaining

failures are likely to be due to human error. While this sounds obvious, this trend was not

recognized until the groundbreaking paper “Why Do Internet Services Fail, and What Can

Be Done about It?” was published in 2003 ( Oppenheimer, Ganapathi & Patterson 2003 ).

Thestrategiesfordealingwithhumanerrorcanbecategorizedasgettingbetterhumans,

removing humans from the loop, and detecting human errors and working around them.

We get better humans by having better operational practices, especially those that exer-

cise the skills and behaviors that most need improvement. (See Chapter 15 . )

We remove humans from the loop through automation. Humans may get sloppy and not

do as much checking for errors during a procedure, but automation, once written, will al-

ways check its work (See Chapter 12 . )

Detecting human errors and working around them is also a function of automation. A

pre-check isautomationthatchecksinputsandpreventsaprocessfromrunningifthetests

fail. For example, a pre-check can verify that a recently edited configuration file has no

syntax errors and meets certain other quality criteria. Failing the pre-check would prevent

the configuration file from being put into use.

While pre-checks are intended to prevent problems, the reality is that they tend to lag

behind experience. That is, after each outage we add new pre-checks to prevent that same

human error from creating future outages.

Another common pre-check is for large changes. If a typical change usually consists of

only a few lines, a pre-check might require additional approval if the change is larger than

a particular number of lines. The change might be in the size of the input, the number of

changed lines between the current input and new input, or the number of changed lines

between the current and new output. For example, a configuration file may be used to con-

trol a system that generates other files. The growth of the output by more than a certain

percentage may trigger additional approval.

Another way to be resilient to human error is to have two humans check all changes.

Many source code control systems can be configured to not accept changes from a user

until a second user approves them. All system administration that is done via changes to

files in a source code repository are then checked by a second pair of eyes. This is a very

common operational method at Google.

6.9 Summary

Resiliency is a system's ability to constructively deal with failures. A resilient system de-

tects failure and routes around it.

Search WWH ::

Custom Search

Home