Information Technology Reference
In-Depth Information
Inourpreviousexample,aproblemwassolvedbyrebootingamachine.Causalanalysis
might indicate that the software has a memory leak. Working with the developers to find
and fix the memory leak is a long-term solution.
Even if you are not the developer who will ultimately fix the code, there is plenty of
workthatcanbedonebesideshoundingthedeveloperswhowillprovidethefinalsolution.
Youcansetupmonitoringtocollectinformationabouttheproblem,sothatbeforeandafter
comparisons can be made. You can work with the developers to understand how the issue
is affecting business objectives such as availability.
14.3.2 Postmortems
Apostmortemisaprocessthatanalyzesanoutageanddocumentswhathappenedandwhy,
and makes recommendations about how to prevent that outage in the future.
Agoodpostmortemprocesscommunicatesupanddownthemanagementchain.Itcom-
municates to users that action is being taken. It communicates to peer teams so that inter-
actions (good and bad) are learned. It can also communicate to unrelated teams so they can
learn from your problems.
Thepostmortem processshouldnotstartuntilaftertheoutageiscomplete. Itshouldnot
be a distraction from fixing the outage.
A postmortem is part of the strategy of continuous improvement. Each user-visible out-
age or SLA violation should be followed by a postmortem and conclude with implement-
ation of the recommendations in the postmortem report. By doing so we turn outages into
learning, and learning into action.
Postmortem Purpose
A postmortem is not a finger-pointing exercise. The goal is to identify what went wrong so
theprocesscanbeimprovedinthefuture,nottodeterminewhoistoblame.Nobodyshould
be in fear of getting fired for having their name associated with a technical error. Blame
discourages the kind of openness required to have the transparency that enables problems
tobeidentifiedsothatimprovementscanbemade.Ifapostmortemexerciseisa“nameand
shame” process, then engineers become silent on details about actions and observations in
the future. “Cover your ass” (CYA) behavior becomes the norm. Less information flows,
so management becomes less informed about how work is performed and other engineers
becomelessknowledgeableaboutpitfallswithinthesystem.Asaresult,moreoutageshap-
pen, and the cycle begins again.
The postmortem process records, for any engineers whose actions have contributed to
the outage, a detailed account of actions they took at the time, effects they observed, ex-
pectations they had, assumptions they had made, and their understanding of the timeline
Search WWH ::




Custom Search