Information Technology Reference
In-Depth Information
11.11 Dealing with Failed Code Pushes
Despite all our testing and rigor, sometimes code pushed into production fails. Sometimes
it is a hard failure where the software refuses to start or fails soon after starting. In theory,
ourcanaryingshoulddetecthardfailuresandsimplytakethelonereplicaoutofproduction.
Unfortunately, not all services are replicated or are not replicated in a way that canarying
is possible. At other times the failure is more subtle. Features may, for example, fail cata-
strophically in a way that is not noticed right away and cannot be mitigated through flags
or other techniques. As a result, we must change the software itself.
One method is to roll back to the last known good release. When problems are found,
the software is uninstalled and the most recent good release is reinstalled.
Anothermethodisto roll forward tothenextrelease,whichpresumablyfixestheprob-
lem discovered in the failed release. The problem with this technique is that the next re-
lease might be hours or days away. The failed release must be resilient enough to be usable
for the duration, or workarounds must be available. The resilience techniques discussed in
Chapter 6 can reduce the time pressure involved. Teams wishing to adopt this technique
need to focus on reducing SDP code lead time until it is short enough to make roll forward
viable.
Roll forward works best when servers are highly replicated and canarying is used for
deployments. A catastrophic failure, such as the server not starting, should be found in the
test environment. If for some reason it was not, the first canary would fail, thus preventing
the other replicas from being upgraded. There would be a slight reduction in capacity until
a working release is deployed.
Critics of roll back point out that true roll back is impossible. Uninstalling software and
reinstalling a known good release is still a change. In fact, it is a change that has likely not
been tested in production. Doing untested processes in production should be avoided at all
costs. Doing it only as an emergency measure means doing a risky thing when risk is least
wanted.
Rollforwardhasoverwhelmingbenefitsandacontinuousdeploymentenvironmentcre-
atestheconfidencethatmakesrollforwardpossibleandlessrisky.Pragmaticallyspeaking,
sometimes roll forward is not possible. Therefore most sites use a hybrid solution: roll for-
ward when you can, roll back when you have to. Also, in this situation, sometimes it is
more expedient to push though a small change, one that fixes a specific, otherwise unsur-
mountable problem.This emergency hotfix isrisky,asitusuallyhasnotreceived fulltest-
ing.Theemergencyhotfixesandrollbacksshouldbetracked carefully andprojects should
be spawned to eliminate them.
Search WWH ::




Custom Search