Information Technology Reference
In-Depth Information
as a tradeoff we tolerate a certain amount of downtime and balance it with preparedness so
that the situation is handled well.
Equally,noindividualisperfect.Everyonemakesmistakes.Westrivetomakeasfewas
possible, and never the same one twice. We try to hire people who are meticulous, but also
innovative. And we develop processes and procedures to try to catch the mistakes before
they cause outages, and to handle any outages that do occur as well as possible. As dis-
cussed in Section 14.3.2 , each outage should be treated as an opportunity to learn from our
own and others' mistakes and to improve the system. An outage exposes a weakness and
enables us to identify places to make the system more resilient, to add preventive checks,
and to educate the entire team so that they do not make the same mistake. In this way we
build an organization and a service that is antifragile.
While it is common practice at some companies, it is counterproductive to look for
someone to blame and fire when a major incident occurs. When people fear being fired,
they will adopt behaviors that are antithetical to good operations. They will hide their mis-
takes, reducing transparency. When the real root causes are obscured, no one learns from
the mistake and additional checks are not put in place, meaning that it is more likely to
recur. This is why a part of the DevOps culture is to accept and learn from failure, which
exposes problems and thus enables them to be fixed.
Unfortunately, we often see that when a large company or government web site has a
highly visible outage, its management scrambles to fire someone to demonstrate to all that
the matter was taken seriously.Sometimes the media will inflame the situation, demanding
that someone be blamed and fired and questioning why it hasn't happened yet. The media
may eventually lay the blame on the CEO or president for not firing someone. The best
approach is to release a public version of the postmortem report, as discussed in Section
14.3.2 , not naming individuals, but rather focusing on the lessons learned and the addition-
al checks that have been put in place to prevent it from happening again.
15.1.1 Antifragile Systems
We want our distributed computing systems to be antifragile . Antifragile systems become
stronger the more they are stressed or exposed to random behavior. Resilient systems sur-
vivestressandfailure,butonlyantifragile systemsactually becomestrongerinresponseto
adversity.
Antifragileisnottheoppositeoffragile.Fragileobjectsbreak,orchange,whenexposed
to stress. Therefore the opposite of fragile is the ability to stay unchanged in the face of
stress. A tea cup is fragile and breaks if not treated gently. The opposite would be a tea
cup that stays the same (does not break) when dropped. Antifragile objects, by comparis-
on, react to stress by getting stronger. For example, the process of making steel involves
Search WWH ::




Custom Search