Information Technology Reference
In-Depth Information
one on their toes, DiRT doesn't start exactly at the announced start time but rather after a
small delay. The length of delay is kept secret.
Alargeeventhasmanymovingpartstocoordinate.Thecoordinatorshouldbesomeone
with both technical and project management experience, who can dedicate a sufficient
amount of time to the project. At a very large scale, coordination and planning may require
a dedicated, full-time position even though the event happens every 12 months. Much of
the year will be spent planning and coordinating the test. The remaining months are spent
reviewing outcomes and tracking organization-wide improvement.
Planning
The planning for the event begins many months ahead of time. Teams need time to decide
whatshouldbetested,selectproctors,andconstructtestscenarios.Proctorsareresponsible
for designing and executing tests. Long before the big day, they design tests by document-
ingthegoalofthetest,ascenario,andascriptthatwillbefollowed.Forexample,thescript
might involve calling the oncall person for a service and having that individual simulate a
situation much like a Wheel of Misfortune exercise. Alternatively, the company may plan
to actively take down a system or service and observe the team's reaction. During the actu-
al test, the proctor is responsible for the tests execution.
Knowing the event date as early as possible enables teams to schedule project work and
vacations. Teams may also use this time to do individual drills so that the event can focus
on tests that find the gaps between teams. If the team's individual processes are not well
practiced, then DiRT itself will not go well.
Prior to the first Game Day at Amazon, John Allspaw conducted a series of company-
wide briefings advising everyone of the upcoming test. He indicated it would be on the
scaleofdestroyingacompletedatacenter.Peopledidnotknowwhichdatacenter,whichin-
spired more comprehensive preparation ( Robbins, Krishnan, Allspaw & Limoncelli 2012 ) .
Risk is mitigated by having all test plans be submitted in advance for review and ap-
provalbyacross-functionalteamofexperts.Thisteamchecksforunreasonablerisks.Tests
never done before are riskier and should be done in a sand-box environment or through
simulation. Often it is known ahead of time that certain systems are ill prepared and will
not survive the outage. Pre-fail these systems, mark them as failing the test, and do not in-
volve them in the event. There is nothing to be learned by involving them. These machines
should be whitelisted so they still receive service during the event. For example, if the out-
age is simulated using network filtering, these machines can be excluded from the filter.
Search WWH ::




Custom Search