Information Technology Reference
In-Depth Information
Chapter 15. Disaster Preparedness
Failure is not falling down but refusing to get back up.
—Theodore Roosevelt
Disasters and major outages happen. Everyone in the company from the top down needs to
recognizethatfactandadoptamindsetthatacceptsoutagesandlearnsfromthem.Anopera-
tionsorganizationneedstobeabletohandleoutageswellandavoidrepeatingpastmistakes.
Previously we've examined technology related to being resilient to failures and outages
as well as organizational strategies like oncall. In this chapter we discuss disaster prepared-
nessattheindividual,team,procedural,andorganizationallevels.Peoplemustbetrainedso
that they know the procedure well enough that they can execute it with confidence. Teams
need to practice together to build team cohesion and confidence, and to find and fix pro-
cedural problems. Organizations need to practice to find inter-team gaps and to ensure the
organization as a whole is ready to handle the unexpected.
Every organization should have a strategy to ensure disaster preparedness at all these
levels. At the personnel level, training should be both formal (books, documentation, ment-
oring) and through game play. Teams and organizations should use fire drills and game day
exercisestoimproveprocessesandfindgapsincoverage.SomethingliketheIncidentCom-
mand System (ICS) model, described later, should be used to coordinate recovery from out-
ages.
Successful companies operating large distributed systems like Google, Face-book, Etsy,
Netflix, and others realize that the right way to handle outages is to be prepared for them,
adopt practices that reduce outages in the future, and reduce risk by practicing effective
response procedures. In “Built to Win: Deep Inside Obama's Campaign Tech” ( Gallagher
2012 ) , we learned that even presidential campaigns have found game day exercises critical
to success.
15.1 Mindset
Thefirststepontheroadtodisasterpreparednessistoacknowledgethatdisastersandmajor
outages happen. They are a normal and expected part of business. Therefore we prepare for
them and respond appropriately when they occur.
We want to reduce the number of outages, but eliminating them totally is unrealistic. No
technology is perfect. No computer system runs without fail. Eliminating the last 0.00001
percent of downtime is more expensive than mitigating the first 99.9999 percent. Therefore
Search WWH ::




Custom Search