Operations in a Distributed World - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

opment experience to be able to communicate with developers on their level and have an

appreciation for what developers do, and for what computers can and can't do.

WhenSREsanddeveloperscomefromacommonstaffingpool,thatmeansthatprojects

are allocated a certain number of engineers; these engineers may be developers or SREs.

The end result is that each SRE needed means one fewer developer in the team. Contrast

this to the case at most companies where system administrators and developers are alloc-

ated from teams with separate budgets. Rationally a project wants to maximize the number

ofdevelopers,sincetheywritenewfeatures.Thecommonstaffingpoolencouragesthede-

velopers to create systems that can be operated efficiently so as to minimize the number of

SREs needed.

Another way to encourage developers to write code that minimizes operational load is

to require that excess operational work overflows to the developers. This practice discour-

ages developers from taking shortcuts that create undue operational load. The developers

would share any such burden. Likewise, by requiring developers to perform 5 percent of

operational work, developers stay in tune with operational realities.

Within the SRE team, capping the operational load at 50 percent limits the amount of

manual labor done. Manual labor has a lower return on investment than, for example, writ-

ing code to replace the need for such labor. This is discussed in Section 12.4.2 , “ Reducing

Toil . ”

Many SRE practices relate to finding balance between the desire for change and the

need for stability. The most important of these is the Google SRE practice called Error

Budgets, explained in detail in Section 19.4 .

Central to the Error Budget is the SLA. All services must have an SLA, which specifies

how reliable the system is going to be. The SLA becomes the standard by which all work

is ultimately measured. SLAs are discussed in Chapter 16 .

Any outage or other major SLA-related event should be followed by the creation of a

written postmortem that includes details of what happened, along with analysis and sug-

gestions for how to prevent such a situation in the future. This report is shared within the

company so that the entire organization can learn from the experience. Postmortems focus

on the process and the technology, not finding who to blame. Postmortems are the topic of

Section 14.3.2 . The person who is oncall is responsible for responding to any SLA-related

events and producing the postmortem report.

Oncallisnotjustawaytoreacttoproblems,butratherawaytoreducefutureproblems.

It must be done in a way that is not unsustainably stressful for those oncall, and it drives

behaviors that encourage long-term fixes and problem prevention. Oncall teams are made

upofatleasteightmembersatonelocation,orsixmembersattwolocations.Teamsofthis

sizewillbeoncalloftenenoughthattheirskillsdonotgetstale,andtheirshiftscanbeshort

Search WWH ::

Custom Search

Home