Information Technology Reference
In-Depth Information
Chapter 7. Operations in a Distributed World
The rate at which organizations learn may soon become the only sustainable
source of competitive advantage.
—Peter Senge
Part I of this topic discussed how to build distributed systems. Now we discuss how to run
such systems.
The work done to keep a system running is called operations . More specifically, oper-
ations is the work done to keep a system running in a way that meets or exceeds operating
parameters specified by a service level agreement (SLA). Operations includes all aspects
of a service's life cycle: from initial launch to the final decommissioning and everything in
between.
Operational work tends to focus on availability, speed and performance, security, capa-
city planning, and software/hardware upgrades. The failure to do any of these well results
in a system that is unreliable. If a service is slow, users will assume it is broken. If a sys-
temisinsecure,outsiderscantakeitdown.Withoutpropercapacityplanning,itwillbecome
overloaded and fail. Upgrades, done badly, result in downtime. If upgrades aren't done at
all,bugswillgounfixed.Becausealloftheseactivitiesultimatelyaffectthereliabilityofthe
system, Googlecalls itsoperations team Site Reliability Engineering (SRE).Manycompan-
ies have followed suit.
Operationsisateamsport.Operationsisnotdonebyasinglepersonbutratherbyateam
of people working together. For that reason much of what we describe will be processes and
policies that help you work as a team, not as a group of individuals. In some companies,
processes seem to be bureaucratic mazes that slow things down. As we describe here—and
more important, in our professional experience—good processes are exactly what makes it
possible to run very large computing systems. In other words, process is what makes it pos-
sible for teams to do the right thing, again and again.
Search WWH ::




Custom Search