Information Technology Reference
In-Depth Information
sent, there is usually a second method, such as email, to communicate the complete mes-
sage.
The message should communicate the following information:
Failure Condition: A description of what is wrong in technical terms but in plain
English. For example, “QPS too high on service XYZ” is clear. “Error 42” is not.
Business Impact: The size and scope of the issue—for example, how many ma-
chines or users this affects, and whether service is reduced or completely unavail-
able.
Escalation Chain: The escalation chain is who to contact, and who to contact if
that person does not respond. Generally, one or two chains are defined for each
service or group of services.
Suggested Resolution: Concise instructions of what to do to resolve this issue.
This is best done with a link to the playbook entry related to this alert, as described
in Section 14.2.5 .
Thelasttwoitemsmaybedifficulttowriteatthetimethealertruleiscreated.Thespecific
business impact may not be known, but at least you'll know which service is affected, so
thatinformationcanbeusedasaplaceholder.Whenthealertruleistriggered,thespecifics
will become clear. Take time to record your thoughts so as to not lose this critical informa-
tion.
Updatetheimpactandresolutionaspartofthepostmortemexercise.Bringallthestake-
holders together. Ask the affected stakeholders to explain how their business was impacted
in their own business terms. Ask the operational stakeholders to evaluate the steps taken,
including what went well and what could have been improved. Compare this information
with what is in the playbook and update it as necessary.
17.4.1 Alerting, Escalation, and Acknowledgments
The alert system is responsible for delivering the alert to to the right person and escalating
toothersiftheydonotrespond.Asdescribedin Section14.1.5 , thisinformationisencoded
in the oncall calendar .
In most cases, the workflow involves communicating to the primary oncall person or
people. They acknowledge the alert by replying to the text message with the word “ACK”
or “YES,” clicking on a link, or other means. If there is no acknowledgment after a certain
amount of time, the next person on the escalation list is tried.
Having the ability to negatively acknowledge (“NAK”) the alert saves time during es-
calations. A NAK immediately escalates to the next person on the list. For example, if the
oncall person receives the alert but is unable to attend to the issue because his or her Inter-
Search WWH ::




Custom Search