Information Technology Reference
In-Depth Information
Some teams have a list of tasks that are done during each shift. Some example tasks
include verifying the monitoring system is working, checking that backups ran, and check-
ing for security alerts related to software used in-house. These tasks should be eliminated
throughautomation.However,untiltheyareautomated,assigningresponsibilitytothecur-
rent oncall person is a convenient way to spread the work around the team. These tasks are
generallyonesthatcanbedonebetweenalertsandareassumedtobedonesometimeduring
theshift,thoughitiswisetodothemearlyintheshiftsoasnottoforgetthem.However,if
a shift starts when someone is normally asleep, expecting these tasks to be done at the very
start of the shift is unreasonable. Waking people up for non-emergencies is not healthy.
Tasks may be performed daily, weekly, or monthly. In all cases there should be a way
to register that the task was completed. Either maintain a shared spreadsheet where people
mark things as complete, or automatically open a ticket to be closed when the task is done.
All tasks should have an accompanying bug ID that requests the task be eliminated though
automation or other means. For example, verifying that the monitoring system is running
can be automated by having a system that monitors the monitoring system. (See Section
16.5 , Meta-monitoring . ”) A task such as emptying the water bucket that collects condens-
ation from a temporary cooling device should be eliminated when the temporary cooling
system is finally replaced.
Oncall should be relatively stress-free when there is no active alert.
14.2.3 Alert Responsibilities
Oncealerted, yourresponsibilities change. Youarenowresponsible forverifying theprob-
lem,fixingit,andensuringthatfollow-upworkgetscompleted.Youmaynotbetheperson
who does all of this work, but you are responsible for making sure it all happens through
delegation and handoffs.
YoushouldacknowledgethealertwithintheSLAdescribedpreviously.Acknowledging
the alert tells the alerting system that it should not try to alert the next contact on the escal-
ation list.
Quick Fixes versus Long-Term Fixes
Now the issue is worked on. Your priority is to come up with the best solution that will
resolve the issue within the SLA. Sometimes we have a choice between a long-term fix
and a quick fix. The long-term fix will resolve the fundamental problem and prevent the
issue in the future. It may involve writing code or releasing new software. Rarely can that
be done within the SLA. A quick fix fits within the SLA but may simply push the issue
farther down the road. For example, rebooting a machine may fix the problem for now but
will require rebooting it again in a few days because the technical problem was not fixed.
However, the reboot can be done now and will prevent an SLA violation.
Search WWH ::




Custom Search