Information Technology Reference
In-Depth Information
of events as they occurred. They should be able to do this without fear of punishment or
retribution.
This is not to say that staff members get off the hook for making mistakes. They are on
the hook for many things. They are now the experts responsible for educating the organiz-
ation on how not to make that mistake in the future. They should drive engineering efforts
related to improving the situation.
Acultureofaccountability,ratherthanblame,fostersanorganizationthatvaluesinnov-
ation.Ifblameisusedtoavoidresponsibility,thewholeteamsuffers.Formoreinformation
about this topic, we recommend Allspaw's ( 2009 ) article “Blameless Postmortems and a
Just Culture.”
A Postmortem Report for Every High-Priority Alert
At Google many teams had a policy of writing a postmortem report every time
their monitoring system paged the oncall person. This was done to make sure that
no issues were ignored or “swept under the rug.” As a result there was no back-
sliding in Google's high standards for high uptime. It also resulted in the alerting
system being highly tuned so that very few false alarms were generated.
Postmortem Report
Postmortem reports include four main components: a description of the outage, a timeline
of events, a contributing conditions analysis (CCA), and recommendations to prevent the
outage in the future. The outage description should say who was affected (for example,
internal customers or external customers) as well as which services were disrupted. The
timeline of events may be reconstructed after the fact, but should identify the sequence of
what actually happened and when so that it is clear. The CCA should go into detail as to
why the outage occurred and include any significant context that may have contributed to
the outage (e.g., peak service hours, significant loads). Finally, the recommendations for
prevention in the future should include a filed ticket orbugIDforeach recommendation in
the list.
You will find a sample postmortem template in Section D.3 of Appendix D . If your or-
ganization does not have a postmortem template, you can use this as the basis for yours.
The executive summary should include the most basic information of when the incident
happened and what the root causes were. It should reiterate any recommendations that will
need budget approval so that executives can connect the budget request to the incident in
their mind.
Search WWH ::




Custom Search