Information Technology Reference
In-Depth Information
Suppose the alert relates to indicators that your web site is slow and often timing out.
First we Observe : checking logs, reading I/O measurements, and so on.
Next we Orient ourselves to the situation. Orienting is the act of analyzing and inter-
preting the data. For example, logs contain many fields, but to turn that data into informa-
tion the logs need to be queried to find anomalies or patterns. In this process we come up
with a hypothesis based on the data and our experience to find the real cause.
Nowwe Decide todosomething.Sometimeswedecidethatmoreinformationisneeded
and begin to collect it. For example, if there are indications that the database is slow, then
we collect more specific diagnostics from the database and restart the loop.
The last stage is to Act and make changes that will either fix the problem, test a hypo-
thesis, or give us more data to analyze. If you decide that certain queries are making the
database server slow, eventually someone has to take action to fix them.
TheOODAloopwillalmostalwayshavemanyiterations.Moreexperiencedsystemad-
ministrators can iterate through the loop logically, rapidly, and smoothly. Also, over time a
good team develops tools to make the loop go faster and gets better at working together to
tighten the loop.
14.2.5 Oncall Playbook
Ideally, every alert that the system can generate will be matched by documentation that de-
scribes what to do in response. An oncall playbook is this documentation.
The general format is a checklist of things to check or do. If the end of the list is
reached, the issue is escalated to the oncall escalation point (which itself may be a rotation
of people). This creates a self-correcting feedback loop. If people feel that there are too
many escalations waking up them late at night, they can correct the problem by improving
the documentation to make oncall more self-sufficient.
Iftheyfeelthatwritingdocumentationisunimportantor“someoneelse'sjob,”theycan,
by virtue of not creating proper checklists, give oncall permission to wake them up at all
hours of the night. It is impressive how someone who feels that writing documentation is
below them suddenly learns the joy of writing after being woken up in the middle of the
night. The result of this feedback loop is that each checklist becomes as detailed as needed
to achieve the right balance.
When writing an oncall playbook, it can be a challenge to determine how detailed each
checklist shouldbe.Astatement like“Checkthestatusofthedatabase” mightbesufficient
for an experienced person. The actual steps required to do that, and instructions on what is
considered normal, should be included. It is usually too much detail to explain very basic
information like how to log in.
Search WWH ::




Custom Search