Information Technology Reference
In-Depth Information
Operations people are in the business of stability. They want nothing to break so they
don't get paged or otherwise have a bad day. This makes them risk averse. If they could,
they would reject a developer's request to push new releases into production. If it ain't
broke, don't fix it. The question they get the most from management is likely to be, “Why
was the system down?”
Once a system is stable, operations would prefer to reject new software releases.
However, it is culturally unacceptable to do so. Instead, rules are created to prevent prob-
lems. They start as simple rules: no upgrades on Friday; if something goes wrong, we
shouldn't have to spend the weekend debugging it. Then Mondays are eliminated because
human errors are perceived to increase then. Then early mornings are eliminated, as are
late nights. More and more safeguards are added prior to release: 1 percent tests go from
being optional to required. Basically operations never says “no” directly but enough rules
accumulate that “no” is virtually enforced.
Not to be locked out of shipping code, developers work around these rules. They hide
large amounts of untested code releases behind flag flips; they encode major features
in configuration files so that software upgrades aren't required, just new configurations.
Workarounds like these circumvent operations' approvals and do so at great risk.
This situation is not the fault of the developers or the operations teams. It is the fault of
the manager who decreed that any outage is bad. One hundred percent uptime is for pace-
makers, not web sites. The typical user is connecting to the web site via WiFi, which has
an availability of 99 percent, possibly less. This dwarfs any goal of perfection demanded
from on high.
19.4.2 A Unified Goal
Typically four 9s (99.99 percent) availability is sufficient for a web site. That leaves a
“budget” of 0.01 percent downtime, a bit less than an hour each year (52.56 minutes).
Thus the Google Error Budget was created. Rather than seeking perfect uptime, a certain
amountofimperfectionisbudgetedforeachquarter.Withoutpermissiontofail,innovation
is stifled. The Error Budget encourages risk taking without encouraging carelessness.
Atthestartofeachquarter,thebudgetisresetto13minutes,whichisabout0.01percent
of 90 days. Any unavailability subtracts from the budget. If the budget has not been ex-
hausted, developers may release as often as they want. When the budget is exhausted, all
launches stop. An exception is made for high-priority security fixes. The releases begin
again when the counter resets and there is once again a 13-minute budget in place.
As a result, operations is no longer put into the position of having to decide whether to
permit a launch. Being in such a position makes them “the bad guys” every time they say
no,andleadsdeveloperstothinkofthemas“theenemytobedefeated.”Moreimportantly,
Search WWH ::




Custom Search