Upgrading Live Services - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

• Manual Stop: There should be a list of people who can, with the click of a button,

halt automated pushes. This is akin to assembly lines where anyone can halt pro-

duction if a defect is found. The reason for a manual pause does not have to be an

emergency.

• Push Conflicts: A service may be made up of many subservices, each on its own

release schedule. It can be prudent to permit only one subservice deployment at a

time. Similarly, the next push should not start if the current one hasn't finished.

• Intentional Delays: It can be useful to have a pause between pushes to let the cur-

rent one “soak”—that is, to run long enough to verify that it is stable. Doing

pushes too rapidly may make it difficult to isolate when a problem began.

• Resource Contention: Pushes should be paused if resources are low—for ex-

ample, if disk space is low or there is unusually high CPU utilization. Load must

be below a particular threshold: don't push when when the system is flooded. Suf-

ficient redundancy must exist: don't push if replicas are not N + 2.

It might seem risky to turn these hunches into automated processes. The truth is that it is

safer to have them automated and always done than letting a person decide to veto a push

because he or she has a gut feeling. Operations should be based on data and science. Auto-

mating these checks means they are executed every time, consistently, no matter who is on

vacation. They can be fine-tuned and improved. Many times we've heard people comment

that an outage happened because they lazily decided not to do a certain check. The truth is

that automated checks can improve safety.

In practice, humans aren't any better at catching regressions than automated tests. In

fact,tothinkotherwiseisabsurd.Forexample,oneregressionthatpeopleusuallywatchfor

is a service that requires significantly more RAM or CPU than the previous release. This is

often an indication of a memory leak or other coding error. People often miss such issues

even when automated systems detect them and provide warnings. Continuous deployment,

when properly implemented, will not ignore such warnings.

Implementingcontinuousdeploymentisnontrivial,andeasierifdoneatthestartofnew

projects when the project is small. Alternatively, one can adopt a policy of using continu-

ous deployment for any new subsystems. Older systems eventually are retired or can have

continuous deployment added if they stick around.

Search WWH ::

Custom Search

Home