Disaster Preparedness - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

as a tradeoff we tolerate a certain amount of downtime and balance it with preparedness so

that the situation is handled well.

Equally,noindividualisperfect.Everyonemakesmistakes.Westrivetomakeasfewas

possible, and never the same one twice. We try to hire people who are meticulous, but also

innovative. And we develop processes and procedures to try to catch the mistakes before

they cause outages, and to handle any outages that do occur as well as possible. As dis-

cussed in Section 14.3.2 , each outage should be treated as an opportunity to learn from our

own and others' mistakes and to improve the system. An outage exposes a weakness and

enables us to identify places to make the system more resilient, to add preventive checks,

and to educate the entire team so that they do not make the same mistake. In this way we

build an organization and a service that is antifragile.

While it is common practice at some companies, it is counterproductive to look for

someone to blame and fire when a major incident occurs. When people fear being fired,

they will adopt behaviors that are antithetical to good operations. They will hide their mis-

takes, reducing transparency. When the real root causes are obscured, no one learns from

the mistake and additional checks are not put in place, meaning that it is more likely to

recur. This is why a part of the DevOps culture is to accept and learn from failure, which

exposes problems and thus enables them to be fixed.

Unfortunately, we often see that when a large company or government web site has a

highly visible outage, its management scrambles to fire someone to demonstrate to all that

the matter was taken seriously.Sometimes the media will inflame the situation, demanding

that someone be blamed and fired and questioning why it hasn't happened yet. The media

may eventually lay the blame on the CEO or president for not firing someone. The best

approach is to release a public version of the postmortem report, as discussed in Section

14.3.2 , not naming individuals, but rather focusing on the lessons learned and the addition-

al checks that have been put in place to prevent it from happening again.

15.1.1 Antifragile Systems

We want our distributed computing systems to be antifragile . Antifragile systems become

stronger the more they are stressed or exposed to random behavior. Resilient systems sur-

vivestressandfailure,butonlyantifragile systemsactually becomestrongerinresponseto

adversity.

Antifragileisnottheoppositeoffragile.Fragileobjectsbreak,orchange,whenexposed

to stress. Therefore the opposite of fragile is the ability to stay unchanged in the face of

stress. A tea cup is fragile and breaks if not treated gently. The opposite would be a tea

cup that stays the same (does not break) when dropped. Antifragile objects, by comparis-

on, react to stress by getting stronger. For example, the process of making steel involves

Search WWH ::

Custom Search

Home