Information Technology Reference
In-Depth Information
opment experience to be able to communicate with developers on their level and have an
appreciation for what developers do, and for what computers can and can't do.
WhenSREsanddeveloperscomefromacommonstaffingpool,thatmeansthatprojects
are allocated a certain number of engineers; these engineers may be developers or SREs.
The end result is that each SRE needed means one fewer developer in the team. Contrast
this to the case at most companies where system administrators and developers are alloc-
ated from teams with separate budgets. Rationally a project wants to maximize the number
ofdevelopers,sincetheywritenewfeatures.Thecommonstaffingpoolencouragesthede-
velopers to create systems that can be operated efficiently so as to minimize the number of
SREs needed.
Another way to encourage developers to write code that minimizes operational load is
to require that excess operational work overflows to the developers. This practice discour-
ages developers from taking shortcuts that create undue operational load. The developers
would share any such burden. Likewise, by requiring developers to perform 5 percent of
operational work, developers stay in tune with operational realities.
Within the SRE team, capping the operational load at 50 percent limits the amount of
manual labor done. Manual labor has a lower return on investment than, for example, writ-
ing code to replace the need for such labor. This is discussed in Section 12.4.2 , “ Reducing
Toil .
Many SRE practices relate to finding balance between the desire for change and the
need for stability. The most important of these is the Google SRE practice called Error
Budgets, explained in detail in Section 19.4 .
Central to the Error Budget is the SLA. All services must have an SLA, which specifies
how reliable the system is going to be. The SLA becomes the standard by which all work
is ultimately measured. SLAs are discussed in Chapter 16 .
Any outage or other major SLA-related event should be followed by the creation of a
written postmortem that includes details of what happened, along with analysis and sug-
gestions for how to prevent such a situation in the future. This report is shared within the
company so that the entire organization can learn from the experience. Postmortems focus
on the process and the technology, not finding who to blame. Postmortems are the topic of
Section 14.3.2 . The person who is oncall is responsible for responding to any SLA-related
events and producing the postmortem report.
Oncallisnotjustawaytoreacttoproblems,butratherawaytoreducefutureproblems.
It must be done in a way that is not unsustainably stressful for those oncall, and it drives
behaviors that encourage long-term fixes and problem prevention. Oncall teams are made
upofatleasteightmembersatonelocation,orsixmembersattwolocations.Teamsofthis
sizewillbeoncalloftenenoughthattheirskillsdonotgetstale,andtheirshiftscanbeshort
Search WWH ::




Custom Search