Information Technology Reference
In-Depth Information
Causal analysis or contributing conditions analysis finds the conditions that brought
about the outage. It is sometimes called root cause analysis but that implies that outages
have only one cause. If tradition or office politics requires using the word “root,” at least
call it a root causes analysis to emphasize that there are many possible causes.
Whileemotionallysatisfyingtobeabletopointtoasinglecause,therealityisthatthere
are many factors leading up to an outage. The belief that an outage could have a single
cause implies that operations is a series of dominos that topple one by one, leading up to
an outage. Reality is much more complex. As Allspaw's ( 2012a ) article “Each Necessary,
ButOnlyJointlySufficient”pointsout,findingtherootcauseofafailureislikefindingthe
root cause of a success.
Postmortem Communication
Once the postmortem report is complete, copies should be sent to the appropriate teams,
including the teams involved in fixing the outage and the people affected by the outage.
If the users were external to the company, a version with proprietary information removed
should be produced. Be careful to abide by your company's policy about external commu-
nications. The external version may be streamlined considerably. Publishing postmortems
externally builds customer confidence, and it is a best practice.
When communicating externally, the postmortem report should be accompanied by an
introduction that is less technical and highlights the important details. Most external cus-
tomers will not be able to understand a technical postmortem.
Include specific details such as start and end times, who or what was impacted, what
went wrong, and what were the lessons learned. Demonstrate that you are using the exper-
ience to improve in the future. If possible, include human elements such as heroic efforts,
unfortunate coincidences, and effective teamwork. You may also include what others can
learn from this experience.
Itisimportantthatsuchcommunicationbeauthentic,admitfailure,andsoundlikeahu-
man, not a press agent. Figure 14.1 is an example of good external communication. Notice
that it is written in the first person, and contains real remorse—no hiding here. Avoid the
temptation to hide by using the third person or to minimize the full impact of the outage by
saying something like “We regret the impact it may have had on our users and customers.”
Don't regret that there may have been impact. There was impact—otherwise you wouldn't
be sending this message. More advice can be found in the blog post “A Guideline for Post-
mortem Communication” on the Transparent Uptime blog ( Rachitsky 2010 ) .
Search WWH ::




Custom Search