Information Technology Reference
In-Depth Information
Chapter 14. Oncall
Be alert... the world needs more lerts.
—Woody Allen
Oncall is the way we handle exceptional situations. Even though we try to automate all op-
erationaltasks,therewillalwaysberesponsibilitiesandedgecasesthatcannotbeautomated
away. These exceptional situations can happen at any time of the day; they do not schedule
themselves nicely between the hours of 9 AM and 5 PM .
Exceptional situations are, in brief, outages and anything that, if left unattended, would
leadtoanoutage.Morespecifically,theyaresituationswheretheserviceis,orwillbecome,
in violation of the SLA.
An operations team needs a strategy to assure that exceptional situations are attended to
promptly and receive appropriate action. The strategy should be designed to reduce future
reoccurrence of such exceptions.
The best strategy is to establish a schedule whereby at any given time at least one person
is responsible for attending to such issues as his or her top priority. For the duration of the
oncall shift, that person should remain contactable and within reach of computers and other
facilities required to do his or her job. Between exceptions, the oncall person should be fo-
cused on follow-up work related to the exceptions faced during his or her shift.
In this chapter we will discuss this basic strategy plus many variations.
14.1 Designing Oncall
Oncall is the practice of having a group of people take turns being responsible for excep-
tional situations, more commonly known as emergencies or, less dauntingly, alerts. Oncall
schedules typically provide 24 × 7 coverage. By taking turns, people get a break from such
heightened responsibilities, can lead normal lives, and take vacations.
When an alert is received, the person on call responds and resolves the issue, using
whatever means necessary to prevent SLA violations, including shortcut solutions that will
not solve the problem in the long term. If he or she cannot resolve the issue, there is an
escalation system whereby other people become involved. After the issue is managed, any
follow-up work should be done during normal business hours—in particular, root causes
analysis, postmortems, and working on long-term solutions.
Normally one person is designated the “oncall person” at any given time. If there is an
alert from the monitoring system, that individual receives the alert and manages the issue
Search WWH ::




Custom Search