Upgrading Live Services - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

11.11 Dealing with Failed Code Pushes

Despite all our testing and rigor, sometimes code pushed into production fails. Sometimes

it is a hard failure where the software refuses to start or fails soon after starting. In theory,

ourcanaryingshoulddetecthardfailuresandsimplytakethelonereplicaoutofproduction.

Unfortunately, not all services are replicated or are not replicated in a way that canarying

is possible. At other times the failure is more subtle. Features may, for example, fail cata-

strophically in a way that is not noticed right away and cannot be mitigated through flags

or other techniques. As a result, we must change the software itself.

One method is to roll back to the last known good release. When problems are found,

the software is uninstalled and the most recent good release is reinstalled.

Anothermethodisto roll forward tothenextrelease,whichpresumablyfixestheprob-

lem discovered in the failed release. The problem with this technique is that the next re-

lease might be hours or days away. The failed release must be resilient enough to be usable

for the duration, or workarounds must be available. The resilience techniques discussed in

Chapter 6 can reduce the time pressure involved. Teams wishing to adopt this technique

need to focus on reducing SDP code lead time until it is short enough to make roll forward

viable.

Roll forward works best when servers are highly replicated and canarying is used for

deployments. A catastrophic failure, such as the server not starting, should be found in the

test environment. If for some reason it was not, the first canary would fail, thus preventing

the other replicas from being upgraded. There would be a slight reduction in capacity until

a working release is deployed.

Critics of roll back point out that true roll back is impossible. Uninstalling software and

reinstalling a known good release is still a change. In fact, it is a change that has likely not

been tested in production. Doing untested processes in production should be avoided at all

costs. Doing it only as an emergency measure means doing a risky thing when risk is least

wanted.

Rollforwardhasoverwhelmingbenefitsandacontinuousdeploymentenvironmentcre-

atestheconfidencethatmakesrollforwardpossibleandlessrisky.Pragmaticallyspeaking,

sometimes roll forward is not possible. Therefore most sites use a hybrid solution: roll for-

ward when you can, roll back when you have to. Also, in this situation, sometimes it is

more expedient to push though a small change, one that fixes a specific, otherwise unsur-

mountable problem.This emergency hotfix isrisky,asitusuallyhasnotreceived fulltest-

ing.Theemergencyhotfixesandrollbacksshouldbetracked carefully andprojects should

be spawned to eliminate them.

Search WWH ::

Custom Search

Home