Information Technology Reference
In-Depth Information
Rosanne: Not in my topic. You see, your SLA says that your service is supposed to be
able to survive two datacenter outages at the same time.
She is correct. Our company standard is to be able to survive two outages at the same
time. The reason is simple. Datacenters and services need to be able to be taken down oc-
casionally for planned maintenance. During this window of time, another datacenter might
go down for unplanned reasons such as a network or power outage. The ability to survive
two simultaneous outages is called N + 2 redundancy.
Tom: So what do you want me to do?
Rosanne: Pretend the datacenter in Europe is going down for scheduled preventive
maintenance.
I follow our procedure and temporarily shut down the service in Europe. Web traffic
from our European customers distributes itself over the remaining two datacenters. Since
this is an orderly shutdown, no queries are lost.
Tom: Done!
Rosanne: Are you within the SLA?
I look at the dashboard and see that the latency has increased further. The entire service
is running on the two smaller datacenters. Each of the two down datacenters is bigger than
thecombined,smaller,workingdatacenters,yetthereisenoughcapacitytohandlethissitu-
ation.
Tom: We're just barely within the SLA.
Rosanne: Congrats. You pass. You may bring the service up in the European datacenter.
I decide to file a bug anyway. We stayed within the SLA, but it was too close for com-
fort. Certainly we can do better.
I look at my clock and see that it is almost 3 PM . I finish filling out the post-exercise
document just as the next oncall person comes online. I send her an instant message to ex-
plain what she missed.
I also remind her to keep her office door locked. There's no telling where the zombies
might strike next.
15.5 Incident Command System
The public safety arena uses the Incident Command System to manage outages. IT opera-
tions can adapt that process for handling operational outages. This idea was first popular-
ized by Brent Chapman in his talk “Incident Command for IT: What We Can Learn from
the Fire Department” ( Chapman 2005 ) . Brent has extensive experience in both IT opera-
tions and public safety.
Search WWH ::




Custom Search