Oncall - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

Some teams have a list of tasks that are done during each shift. Some example tasks

include verifying the monitoring system is working, checking that backups ran, and check-

ing for security alerts related to software used in-house. These tasks should be eliminated

throughautomation.However,untiltheyareautomated,assigningresponsibilitytothecur-

rent oncall person is a convenient way to spread the work around the team. These tasks are

generallyonesthatcanbedonebetweenalertsandareassumedtobedonesometimeduring

theshift,thoughitiswisetodothemearlyintheshiftsoasnottoforgetthem.However,if

a shift starts when someone is normally asleep, expecting these tasks to be done at the very

start of the shift is unreasonable. Waking people up for non-emergencies is not healthy.

Tasks may be performed daily, weekly, or monthly. In all cases there should be a way

to register that the task was completed. Either maintain a shared spreadsheet where people

mark things as complete, or automatically open a ticket to be closed when the task is done.

All tasks should have an accompanying bug ID that requests the task be eliminated though

automation or other means. For example, verifying that the monitoring system is running

can be automated by having a system that monitors the monitoring system. (See Section

16.5 , “ Meta-monitoring . ”) A task such as emptying the water bucket that collects condens-

ation from a temporary cooling device should be eliminated when the temporary cooling

system is finally replaced.

Oncall should be relatively stress-free when there is no active alert.

14.2.3 Alert Responsibilities

Oncealerted, yourresponsibilities change. Youarenowresponsible forverifying theprob-

lem,fixingit,andensuringthatfollow-upworkgetscompleted.Youmaynotbetheperson

who does all of this work, but you are responsible for making sure it all happens through

delegation and handoffs.

YoushouldacknowledgethealertwithintheSLAdescribedpreviously.Acknowledging

the alert tells the alerting system that it should not try to alert the next contact on the escal-

ation list.

Quick Fixes versus Long-Term Fixes

Now the issue is worked on. Your priority is to come up with the best solution that will

resolve the issue within the SLA. Sometimes we have a choice between a long-term fix

and a quick fix. The long-term fix will resolve the fundamental problem and prevent the

issue in the future. It may involve writing code or releasing new software. Rarely can that

be done within the SLA. A quick fix fits within the SLA but may simply push the issue

farther down the road. For example, rebooting a machine may fix the problem for now but

will require rebooting it again in a few days because the technical problem was not fixed.

However, the reboot can be done now and will prevent an SLA violation.

Search WWH ::

Custom Search

Home