Information Technology Reference
In-Depth Information
Today, Google's DiRT process is possibly the largest such exercise in the world. By
2012 the number of teams involved had multiplied by 20, covering all SRE teams and
nearly all services.
Growing the process to this size depended on creating a culture where identifying prob-
lems is considered a positive way to understand how the system can be improved rather
than a cause for alarm, blame, and finger-pointing. Some operations teams could not see
thebenefitoftestingbeyondwhattheirservicedeliveryplatform'scontinuousdeliverysys-
tem already provided. The best predictor of a team's willingness to start participating was
whetherpreviousfailureshadresultedinasearchforarootcausetobefixedorapersonto
be blamed. Being able to point to earlier, smaller successes gave new teams and executive
management confidence in expanding the program.
An example complex test might involve simulating an earthquake or other disaster that
makes the company headquarters unavailable. Forbid anyone at headquarters from talking
totherestofthecompany.GoogleDiRTdidthisandlearnedthatitsremotesitescouldcon-
tinue, but the approval chain for emergency purchases (such as fuel for backup generators)
required the consent of people at the company headquarters. Such key findings are non-
technical.Anothernontechnicalfindingwasthatifallthetestsleavepeopleatheadquarters
with nothing to do, they will flood the cafeteria, creating a DoS flood of the food kind.
Corporateemergencycommunicationsplansshouldalsobetested.Duringmostoutages
people can communicate using the usual chat rooms and such. However, an emergency
communication mechanism is needed in the event of a total network failure. The first
Google DiRT exercise found that exactly one person was able to find the emergency com-
munication plan and show up on the correct phone bridge. Now periodic fire drills spot-
check whether everyone has the correct information with them. In a follow-up drill, more
than 100 people were able to find and execute the emergency communication plan. At that
point, Google learned that the bridge supported only 40 callers. During another drill, one
caller put the bridge on hold, making the bridge unusable due to “hold music” flooding the
bridge.Arequirement tohavetheability tokicksomeoneoffthebridgewasidentified. All
of these issues were discovered during simulated disasters. Had they been discovered dur-
ing a real emergency, it would have been a true disaster.
15.4.3 Implementation and Logistics
There are two kinds of tests. Global tests involve major events such as taking down a data-
center and are initiated by the event planners. Team tests are initiated by individual teams.
An event may last multiple days or a single terrible day. Google schedules an entire
weekbutendsDiRTsessionsassoonasthetests'goalshavebeensatisfied.Tokeepevery-
Search WWH ::




Custom Search