Information Technology Reference
In-Depth Information
delegates a person on the team to update status pages for XYZ Company's customers as
well as an internal status page within the company.
Meanwhile, Bob in Operations has determined that a database needs to be failed over
and replicas updated. Managing by objective, he asks Logistics for another resource to do
thereplicaandloadbalancingwork.Logisticsfindssomeoneelseinthecompany'sITstaff
whohasexperiencewiththereplicasystemandgetspermissionforthatpersontohelpBob
duringthisoutage.Bobbeginsthedatabase failover andhisnewhelperbeginsworkonthe
load balancing and replica work needed.
The new replica completes its initial copy and begins serving requests. Janet confirms
with all ICS section chiefs that the service's status has returned to normal. The event is de-
clared resolved and the ICS process is explicitly terminated. Janet takes the action item to
lead the postmortem effort.
15.6 Summary
To handle major outages and disasters well, we must prepare and practice. Ignorance may
be bliss, but practice makes progress. It is better to learn that a disaster recovery process is
broken by testing it in a controlled environment than to be surprised when it breaks during
an actual emergency.
Tobepreparedateverylevel,astrategyofpracticingdisasterrecoverytechniquesatthe
individual, team, and organization levels is required. Each level is dependent on the com-
petency achieved in the previous level.
Wheel of Misfortune is a game that trains individuals by talking through common, and
not so common, disaster scenarios. Fire drills are live tests performed to exercise a partic-
ular process. Fire drills should first be performed on a process again and again by the same
people until the process worksandcan beperformed smoothly.Then the process shouldbe
done by each member of the team until everyone is confident in his or her ability to per-
form the task.
Tests involving shutting down randomly selected machines or servers can find untested
failure scenarios. These tests can be done at designated times as a test, or continuously as
part of production to ensure that systems that should be resilient to failure have not re-
gressed.
GameDayorDiRTexercisesareorganization-wideteststhatfindgapsinprocessesthat
involvemultipleteams.Theyoftenlastmultipledaysandinvolvecuttingoffmajorsystems
or datacenters.
DiRT events require a large amount of planning and coordination to reduce risk. Tests
should be approved by a central planning committee based on quality, impact, and risk.
Search WWH ::




Custom Search