Information Technology Reference
In-Depth Information
time the schedule was locked. This system was unfair to people who happened to be out
the day the schedule was made.
Some companies take a more algorithmic approach. Google had hundreds of individual
calendars to create for any given month due to the existence of many internal and external
services.Eachteamspentalotoftimenegotiatingandassemblingcalendarsuntilsomeone
wroteaprogramthatdidthetaskforthem.Tousethesystem,ateamwouldcreateaGoogle
Calendar and everyone inserted events to mark which days they were entirely unavailable,
available but not preferred, available, or preferred. The system took a configuration file
that described parameters such as how long each shift was, whether there was a required
gap of time before someone could have another rotation, and so on. The system then read
people's preferences from the Google Calendar and churned on the data until a reasonable
oncall calendar was created.
14.1.6 Oncall Frequency
The frequency of how often a person goes oncall needs careful consideration. Each alert
has a certain amount of follow-up work that should be completed before the next turn at
oncall begins. Each person should also have sufficient time between oncall shifts to work
on projects, not just follow-up work.
The follow-up work from an alert can be extensive. Writing a postmortem can be an ar-
duous task. Root cause analysis can involve extensive research that lasts days or weeks.
Thelongertheoncallshift,themorealertswillbereceivedandthemorefollow-uppro-
jects the person will be trying to do at the same time. This can overload a person.
The more closely the shifts are spaced, the more likely the work will not be completed
by the time the next shift starts.
Doing one or two postmortems simultaneously is reasonable, but much more is impos-
sible. Therefore shifts should be long enough that only one or two significant alerts have
accumulated.Dependingontheservice,thismaybeoneday,aweekof8-hourperiods,ora
week of 24 × 7 service. The next such segment should be spaced at least three weeks apart
if the person is expected to complete both postmortems, do project work, and be able to go
onanoccasional vacation. Ifaservice receives somanyalerts that thisisnotpossible, then
the service has deeper issues.
Oncall shifts can be stressful. If the source of stress is that the shift is too busy, consider
using shorter shifts or having a second person oncall to handle overflow. If the source of
stress is that people do not feel confident in their ability to handle the alerts, additional
training is recommended. Ways to train a team to be more comfortable dealing with outage
situations are discussed in Chapter 15 .
Search WWH ::




Custom Search