Operations in a Distributed World - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

workisallocated.Thecasestudyattheendof Section2.2.2 isanexampleofthisapproach.

Similarly, this allocation can be achieved by assigning dedicated people to stability-related

code changes.

The budget can also be based on an SLA. A certain amount of instability is expected

each month, which is considered a budget. Each roll-out uses some of the budget, as do

instability-related bugs. Developers can maximize the number of roll-outs that can be done

eachmonthbydedicatingefforttoimprovethecodethatcausesthisinstability.Thiscreates

a positive feedback loop. An example of this is Google's Error Budgets, which are more

fully explained in Section 19.4 .

7.1.3 Defining SRE

ThecorepracticesofSREwererefinedformorethan10yearsatGooglebeforebeingenu-

merated in public. In his keynote address at the first USENIX SREcon, Benjamin Treynor

Sloss (2014), Vice President of Site Reliability Engineering at Google, listed them as fol-

lows:

Site Reliability Practices

1. Hire only coders.

2. Have an SLA for your service.

3. Measure and report performance against the SLA.

4. Use Error Budgets and gate launches on them.

5. Have a common staffing pool for SRE and Developers.

6. Have excess Ops work overflow to the Dev team.

7. Cap SRE operational load at 50 percent.

8. Share 5 percent of Ops work with the Dev team.

9. Oncall teams should have at least eight people at one location, or six people at

each of multiple locations.

10. Aim for a maximum of two events per oncall shift.

11. Do a postmortem for every event.

12. Postmortems are blameless and focus on process and technology, not people.

ThefirstprincipleforsitereliabilityengineeringisthatSREsmustbeabletocode.AnSRE

might not be a full-time software developer, but he or she should be able to solve nontrivi-

al problems by writing code. When asked to do 30 iterations of a task, an SRE should do

the first two, get bored, and automate the rest. An SRE must have enough software devel-

Search WWH ::

Custom Search

Home