Information Technology Reference
In-Depth Information
itisunfairtoputoperationsinthispositionbecauseoftheinformation asymmetry inherent
in this relationship. Developers know the code better than operations and therefore are in
a better position to perform testing and judge the quality of the release. Operations staff,
though they are unlikely to admit it, are not mind readers.
19.4.3 Everyone Benefits
For developers, the Error Budget creates incentives to improve reliability by offering them
something they value highly: the opportunity to do more releases. This encourages them to
testreleasesmorethoroughly,toadoptbetterreleasepractices,andtoinvesteffortinbuild-
ing frameworks that improve operations and reliability. Previously these tasks might have
been considered distractions from creating new features. Now these tasks create the ability
to push more features.
For example, developers may create a framework that permits new code to be tested
better, or to perform 1 percent experiments with less effort. They are encouraged to take
advantage of existing frameworks they may not have considered before. For example, im-
plementation of lame-duck mode, as described in Section 2.1.3 , may be built into the web
framework they use, but they have simply not taken advantage of it.
More importantly, the budget creates peer pressure between developer teams to have
high standards. Development for a given service is usually the result of many subteams.
Each team wants to launch frequently. Yet one team can blow the budget for all teams if
they are not careful. Nobody wants to be the last team to adopt a technology or framework
thatimproveslaunchsuccess.Also,thereislessinformationasymmetrybetweendeveloper
teams. Therefore teams can set high standards for code reviews and other such processes.
(Code reviews are discussed in Section 12.7.6 .)
This does not mean that Google considers it okay to be down for an hour each year. If
yourecallfrom Section1.3 , user-visibleservicesareoftencomposedoftheoutputofmany
other services. If one of those services is not responding, the composition can still succeed
by replacing the missing part with generic filler, by showing blank space, or by using other
graceful degradation techniques as described in Section 2.1.10 .
This one KPI has succeeded in improving availability at Google and at the same time
hasaligneddeveloperandoperationspriorities,helpingthemworktogether.Itremovesop-
erations from the “bad guy” role ofhaving to refuse releases, and it gives developers an in-
centive to balance time between adding new features and improving operational processes.
It is simple to explain and, since availability is already tightly monitored, easy to imple-
ment. As a result, all of Google's services benefit.
Search WWH ::




Custom Search