Information Technology Reference
In-Depth Information
enough that each catches no more than two outage events. As a result, each member has
enough time to follow through on each event, performing the required long-term solution.
Managing oncall this way is the topic of Chapter 14 .
Other companies have adopted the SRE job title for their system administrators who
maintain live production services. Each company applies a different set of practices to the
role. These are the practices that define SRE at Google and are core to its success.
7.1.4 Operations at Scale
Operations in distributed computing is operations at a large scale. Distributed computing
involves hundreds and often thousands of computers working together. As a result, opera-
tions is different than traditional computing administration.
Manualprocessesdonotscale.Whentasksaremanual,iftherearetwiceasmanytasks,
there is twice as much human effort required. A system that is scaling to thousands of ma-
chines, servers, or processes, therefore, becomes untenable if a process involves manually
manipulating things. In contrast, automation does scale. Code written once can be used
thousands of times. Processes that involve many machines, processes, servers, or services
should be automated. This idea applies to allocating machines, configuring operating sys-
tems, installing software, and watching for trouble. Automation is not a “nice to have” but
a “must have.” (Automation is the subject of Chapter 12 . )
Whenoperationsisautomated,systemadministrationismorelikeanassemblylinethan
a craft. The job of the system administrator changes from being the person who does the
work to the person who maintains the robotics of an assembly line. Mass production tech-
niques become viable and we can borrow operational practices from manufacturing. For
example, bycollecting measurements fromevery stage ofproduction, we can apply statist-
ical analysis that helps us improve system throughput. Manufacturing techniques such as
continuous improvement are the basis for the Three Ways of DevOps. (See Section 8.2 .)
Three categories of things are not automated: things that should be automated but have
not been yet, things that are not worth automating, and human processes that can't be auto-
mated.
Tasks That Are Not Yet Automated
It takes time to create, test, and deploy automation, so there will always be things that are
waiting to be automated. There is never enough time to automate everything, so we must
prioritize and choose our methods wisely. (See Section 2.2.2 and Section 12.1.1 .)
For processes that are not, or have not yet been, automated, creating procedural docu-
mentation, called a playbook , helps make the process repeatable and consistent. A good
playbook makes it easier toautomate the process inthe future. Often the most difficult part
Search WWH ::




Custom Search