Disaster Preparedness - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

one on their toes, DiRT doesn't start exactly at the announced start time but rather after a

small delay. The length of delay is kept secret.

Alargeeventhasmanymovingpartstocoordinate.Thecoordinatorshouldbesomeone

with both technical and project management experience, who can dedicate a sufficient

amount of time to the project. At a very large scale, coordination and planning may require

a dedicated, full-time position even though the event happens every 12 months. Much of

the year will be spent planning and coordinating the test. The remaining months are spent

reviewing outcomes and tracking organization-wide improvement.

Planning

The planning for the event begins many months ahead of time. Teams need time to decide

whatshouldbetested,selectproctors,andconstructtestscenarios.Proctorsareresponsible

for designing and executing tests. Long before the big day, they design tests by document-

ingthegoalofthetest,ascenario,andascriptthatwillbefollowed.Forexample,thescript

might involve calling the oncall person for a service and having that individual simulate a

situation much like a Wheel of Misfortune exercise. Alternatively, the company may plan

to actively take down a system or service and observe the team's reaction. During the actu-

al test, the proctor is responsible for the tests execution.

Knowing the event date as early as possible enables teams to schedule project work and

vacations. Teams may also use this time to do individual drills so that the event can focus

on tests that find the gaps between teams. If the team's individual processes are not well

practiced, then DiRT itself will not go well.

Prior to the first Game Day at Amazon, John Allspaw conducted a series of company-

wide briefings advising everyone of the upcoming test. He indicated it would be on the

scaleofdestroyingacompletedatacenter.Peopledidnotknowwhichdatacenter,whichin-

spired more comprehensive preparation ( Robbins, Krishnan, Allspaw & Limoncelli 2012 ) .

Risk is mitigated by having all test plans be submitted in advance for review and ap-

provalbyacross-functionalteamofexperts.Thisteamchecksforunreasonablerisks.Tests

never done before are riskier and should be done in a sand-box environment or through

simulation. Often it is known ahead of time that certain systems are ill prepared and will

not survive the outage. Pre-fail these systems, mark them as failing the test, and do not in-

volve them in the event. There is nothing to be learned by involving them. These machines

should be whitelisted so they still receive service during the event. For example, if the out-

age is simulated using network filtering, these machines can be excluded from the filter.

Search WWH ::

Custom Search

Home