Information Technology Reference
In-Depth Information
Canarying Is Not a Substitute for System Testing
We've observed situations where canarying was used to test new releases on live
users. In one case it was done unintentionally—a fact that was not realized until a
major outage occurred. TheSREs received athoroughlytested package andwould
canary it into their production environment. This worked fine for many years be-
cause the test and live environments were very similar.
Overtime,however,manytoolsweredevelopedbytheSREsforuseinthepro-
duction environment. These tools were not tested by the developers' testing sys-
tem.Thedeveloperswerenotresponsibleforthetools,plusmanyofthetoolswere
considered ad hoc or temporary.
There's an old adage in engineering, “Nothing is more permanent than a tem-
porary solution.” Soon these tools grew and begat complex automated systems.
Yet, they were not tested by the developers' testing system. Each major release
broke the tools and the operations staff had to scurry to update them. These prob-
lems were trivial, however, compared to what happened next.
One day a release was pushed into production and problems were not dis-
covered until the push was complete. Service for particular users came to a halt.
By now the hardware used in the two environments had diverged enough that
kernel drivers and virtualization technology versions had diverged. The result was
that virtual machines running certain operating systems stopped working.
At this point the SREs realized the environments had diverged too much. They
needed to completely revamp their system testing environment to make sure it
tested the specific combination of main service release, kernel version, virtualiz-
ation framework version, and hardware that was used in production. In addition,
they needed to incorporate their tools into the repository and the development and
testing process so that each time they wouldn't have to scramble to fix incompat-
ibilities with the tools they had developed.
Creating a proper system testing environment, and a mechanism to keep test
and production in sync, required many months of effort.
11.4 Phased Roll-outs
Another strategy is to partition users into groups that are upgraded one at a time. Each
group, or phase, is identified by its tolerance for risk.
Search WWH ::




Custom Search