Information Technology Reference
In-Depth Information
until it is resolved. During business hours this person works as normal, except that he or
she always works on projects that can be interrupted easily. After normal business hours,
the oncall person should be near enough to a computer so he or she can respond quickly.
There also needs to be a strategy to handle the situation when the oncall person cannot
be reached. This can happen due to commuting, network outages, health emergencies, or
other issues. Generally a secondary oncall person is designated to respond if the primary
person does not respond after a certain amount of time.
14.1.1 Start with the SLA
When designing an oncall scheme for an organization, begin with the SLA for the service.
Work backward to create an SLA for oncall that will result in meeting the SLA for the ser-
vice. Then design the oncall scheme that will meet the oncall SLA.
For example, suppose a service has an SLA that permits 2 hours of downtime before
penalties accrue. Suppose also that typical problems can be solved in 30 minutes, and ex-
treme problems take 30 minutes to cause system failover but usually only after 30 minutes
oftryingothersolutions.Thiswouldmeanthatthetimebetweenwhenanoutagestartsand
when the issue is being actively worked on must be less than an hour.
In that hour, the following things must happen. First, the monitoring system must detect
theoutage.Ifitpollsevery5minutesandalertsonlyafterthreeattempts,amaximumof15
minutes may pass before someone is alerted. This assumes the worst case of the last good
poll happening right before the outage. Let's assume that alerts are sent every 5 minutes
untilsomeoneresponds;everythirdalertresultsinescalationfromprimarytosecondaryor
from secondary to the entire team. The worst case (assuming the team isn't alerted) is six
alerts, or30minutes. Fromreceiving thealert, theoncall personmayneed5-10minutes to
log into the system and begin working. So far we have accumulated about 50-55 minutes
ofoutagebefore“handsonkeyboard”hasbeenachieved.Consideringweestimatedamax-
imum of 60 minutes to fix a problem, this leaves us with 5 minutes to spare.
Every service is different, so you must do these calculations for each one. If you are
managing many services, it can be worthwhile to simplify the process by creating a few
classes of service based on the required response time: 5 minutes, 15 minutes, 30 minutes,
and longer. Monitoring, alerting, and compensation schemes for each class can be defined
and reused for all new services rather than reinventing the wheel each time.
14.1.2 Oncall Roster
The roster isthelistofpeoplewhotaketurnsbeingoncall.Thelistismadeupofqualified
operations staff, developers, and managers. All operations staff should be on the roster.
This is generally considered part of any operations staff member's responsibility.
Search WWH ::




Custom Search