Operations in a Distributed World - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

enough that each catches no more than two outage events. As a result, each member has

enough time to follow through on each event, performing the required long-term solution.

Managing oncall this way is the topic of Chapter 14 .

Other companies have adopted the SRE job title for their system administrators who

maintain live production services. Each company applies a different set of practices to the

role. These are the practices that define SRE at Google and are core to its success.

7.1.4 Operations at Scale

Operations in distributed computing is operations at a large scale. Distributed computing

involves hundreds and often thousands of computers working together. As a result, opera-

tions is different than traditional computing administration.

Manualprocessesdonotscale.Whentasksaremanual,iftherearetwiceasmanytasks,

there is twice as much human effort required. A system that is scaling to thousands of ma-

chines, servers, or processes, therefore, becomes untenable if a process involves manually

manipulating things. In contrast, automation does scale. Code written once can be used

thousands of times. Processes that involve many machines, processes, servers, or services

should be automated. This idea applies to allocating machines, configuring operating sys-

tems, installing software, and watching for trouble. Automation is not a “nice to have” but

a “must have.” (Automation is the subject of Chapter 12 . )

Whenoperationsisautomated,systemadministrationismorelikeanassemblylinethan

a craft. The job of the system administrator changes from being the person who does the

work to the person who maintains the robotics of an assembly line. Mass production tech-

niques become viable and we can borrow operational practices from manufacturing. For

example, bycollecting measurements fromevery stage ofproduction, we can apply statist-

ical analysis that helps us improve system throughput. Manufacturing techniques such as

continuous improvement are the basis for the Three Ways of DevOps. (See Section 8.2 .)

Three categories of things are not automated: things that should be automated but have

not been yet, things that are not worth automating, and human processes that can't be auto-

mated.

Tasks That Are Not Yet Automated

It takes time to create, test, and deploy automation, so there will always be things that are

waiting to be automated. There is never enough time to automate everything, so we must

prioritize and choose our methods wisely. (See Section 2.2.2 and Section 12.1.1 .)

For processes that are not, or have not yet been, automated, creating procedural docu-

mentation, called a playbook , helps make the process repeatable and consistent. A good

playbook makes it easier toautomate the process inthe future. Often the most difficult part

Search WWH ::

Custom Search

Home