Information Technology Reference
In-Depth Information
workisallocated.Thecasestudyattheendof Section2.2.2 isanexampleofthisapproach.
Similarly, this allocation can be achieved by assigning dedicated people to stability-related
code changes.
The budget can also be based on an SLA. A certain amount of instability is expected
each month, which is considered a budget. Each roll-out uses some of the budget, as do
instability-related bugs. Developers can maximize the number of roll-outs that can be done
eachmonthbydedicatingefforttoimprovethecodethatcausesthisinstability.Thiscreates
a positive feedback loop. An example of this is Google's Error Budgets, which are more
fully explained in Section 19.4 .
7.1.3 Defining SRE
ThecorepracticesofSREwererefinedformorethan10yearsatGooglebeforebeingenu-
merated in public. In his keynote address at the first USENIX SREcon, Benjamin Treynor
Sloss (2014), Vice President of Site Reliability Engineering at Google, listed them as fol-
lows:
Site Reliability Practices
1. Hire only coders.
2. Have an SLA for your service.
3. Measure and report performance against the SLA.
4. Use Error Budgets and gate launches on them.
5. Have a common staffing pool for SRE and Developers.
6. Have excess Ops work overflow to the Dev team.
7. Cap SRE operational load at 50 percent.
8. Share 5 percent of Ops work with the Dev team.
9. Oncall teams should have at least eight people at one location, or six people at
each of multiple locations.
10. Aim for a maximum of two events per oncall shift.
11. Do a postmortem for every event.
12. Postmortems are blameless and focus on process and technology, not people.
ThefirstprincipleforsitereliabilityengineeringisthatSREsmustbeabletocode.AnSRE
might not be a full-time software developer, but he or she should be able to solve nontrivi-
al problems by writing code. When asked to do 30 iterations of a task, an SRE should do
the first two, get bored, and automate the rest. An SRE must have enough software devel-
Search WWH ::




Custom Search