Information Technology Reference
In-Depth Information
In general, encourage a bias toward long-term fixes over quick fixes: “a stitch in time
saves nine.” However, oncall is different from normal engineering. Oncall places a higher
priority on speed than on long-term perfection. Since solutions that do not fit within the
SLA must be eliminated, a quick fix may be the only option.
Asking for Help
It is also the responsibility of the oncall person to ask for help when needed. Escalate to
more experienced or knowledgable people, or if the issue was raised long enough ago,
find someone who is better rested than you are. You don't have to save the world single-
handedly. You are allowed to call other folks for help. Reach out to others especially if the
outage is large or if there are multiple alerts at the same time. You don't have to fix the
problem yourself necessarily. Rather, it is your responsibility to make sure it gets fixed,
which sometimes is best done by looping in the right people and coordinating rather than
trying to handle everything yourself.
Follow-up Work
Once the problem has been resolved, the priority shifts to raising the visibility of the issue
so that long-term fixes and optimizations will be done. For simple issues, it may be suf-
ficient to file a bug report or add annotations to an existing one. More complex issues re-
quire writing a postmortem report that captures what happened and makes recommenda-
tions about how it can be prevented in the future. By doing this we build a feedback loop
thatassuresoperationsgetbetterovertime,notworse.Iftheissueisnotgivenvisibility,the
core problem will not be fixed. Do not assume that “everyone knows it is broken” means
that it will get fixed. Not everyone does know it is broken. You can't expect that managers
who prioritize which projects are given resources will know everything or be able to read
your mind. Filing bug reports is like picking up litter: you can assume someone else will
do it, but if everyone did that nothing would ever be clean.
Once the cause is known, the alert should be categorized so that metrics can be gener-
ated. This helps spot trends and the resulting information should be used to determine fu-
ture project priorities. It is also useful to record which machines were involved in a search-
able way. Future alerts can then be related to past ones, and simple trends such as the same
machine failing repeatedly can be spotted.
Other follow-up tasks are discussed in Section 14.3 .
14.2.4 Observe, Orient, Decide, Act (OODA)
The OODA loop was developed for combat operations by John Boyd. Designed for situ-
ationslikefighterjetcombat,itfitshigh-stresssituationsthatrequirequickresponses.Kyle
Brandt ( 2014 ) popularized the idea of applying OODA to system administration.
Search WWH ::




Custom Search