Information Technology Reference
In-Depth Information
Google Chubby Outage Drills
Inside Google is a global lock service called “Chubby.” It has such an excellent
reputation forreliability that otherteams made themistake ofassuming itwasper-
fect. A small outage created big problems for teams that had written code that as-
sumed Chubby could not fail.
Not wanting to encourage bad coding practices, the Chubby team decided that
it would be best for the company to create intentional outages. If a month passed
without at least a few minutes of downtime, they would intentionally take Chubby
down for five minutes. The outage schedule was announced well in advance.
The first planned outage was cancelled shortly before it was intended to begin.
Many critical projects had reported that they would not be able to survive the test.
Teams were given 30 days to fix their code, but warned that there would be no
further delays. Now the Chubby team was taken seriously. The planned outages
have happened ever since.
15.3.2 Random Testing
Anotherstrategyistotestawidevarietyofpotentialfailures.Ratherthanpickingaspecific
failover process to improve, select random machines or other failure domains, cause them
to fail, and verify the system is still running. This can be done on a scheduled day or week,
or in a continuous fashion.
For example, Netflix created many autonomous agents, each programmed to create dif-
ferent kinds of outages. Each agent is called a monkey. Together they form the Netflix
Simian Army.
ChaosMonkeyterminatesrandomvirtualmachines.Itisprogrammedtoselectdifferent
machineswithdifferentprobabilities,andsomemachinesoptoutentirely.EachhourChaos
Monkey wakes up, picks a random machine, and terminates it.
Chaos Gorilla picks an entire datacenter and simulates either a network partition or a
totalfailure.Thiscausesmassivedamage,suchthatrecoveryrequiressophisticatedcontrol
systems to rebalance load. Therefore it is run manually as part of scheduled tests.
The Simian Army is always growing. Newer members include Latency Monkey, which
inducesartificial delaysinAPIcalls tosimulate service degradation. Anextensive descrip-
tionoftheSimianArmycanbefoundinthearticle“TheAntifragileOrganization”( Tseitlin
2013 ) .
Search WWH ::




Custom Search