Information Technology Reference
In-Depth Information
Drillslikethiscarryalargerriskatthestartbecausetheyspansomanyfailuredomains;
consequently, they should be approved by management. Convincing management of the
value of this kind of test may be difficult. Most managers want to avoid problems, not in-
duce them. It is important to approach the topic from the point of view of improving the
ability to respond to problems that will inevitably happen, and by highlighting that the best
time to find problems is in a controlled environment and not late at night when employ-
ees are asleep. Tie the process tobusiness goals involving overall service uptime. Doingso
also improves the morale and confidence of team members.
15.4 Training for Organizations: Game Day/DiRT
Game Day exercises are multi-day, organization-wide disaster preparedness tests. They in-
volve many teams, often including non-technical teams such as communications, logist-
ics, and finance. Game Day exercises focus on testing complex scenarios, trying out rarely
tested interfaces between systems and teams, and identifying unknown organizational de-
pendencies.
Game Day exercises may involve a multi-day outage of a datacenter, a complex week
of network and system outages, or verification that secondary coverage personnel can suc-
cessfully run a service if the primary team disappeared for an extended amount of time.
Such exercises can also be used to rehearse for an upcoming event where additional load
is expected and the team's ability to handle large outages would be critical. For example,
many weeks before the 2012 election day, the Obama campaign performed three all-day
sessions where its election-day “Get Out the Vote” operation was put to the test. This is
described in detail in the article “When the Nerds Go Marching in” ( Madrigal 2012 ) .
Because of the larger scale and scope, this kind of testing can have more impact and
prevent larger outages. Of course, because of the larger scale and scope, this kind of test-
ingalsorequiresmoreplanning,moreinfrastructure,andahigherlevelofmanagementap-
proval, buy-in, and support.
The organization needs to believe that the value realized through learning justifies the
cost.GameDayexercisesmightbeasizableengineeringeffortinvolvinghundredsofstaff-
days of effort. There is a potential for real accidental outages that result in revenue loss.
Executives need to recognize that all systems will inevitably fail and that confidence is
best gained though practice, not avoidance. They should understand that it is best to have
these failures happen in a controlled environment when key people are awake. Learning
that a failover system does not work at 4 AM when key people are asleep or on vacation is
the risk to avoid.
Google's Disaster Recovery Testing (DiRT) is this company's form of Game Day exer-
cises. DiRT is done at a very large scale and focuses on testing interactions between teams.
Search WWH ::




Custom Search