Information Technology Reference
In-Depth Information
Failureisanormalpartofoperationsandcanoccuratanylevel.Largesystemsmagnify
theriskofsmallfailures.Aoneinamillionfailureisadailyoccurrenceifyouhaveenough
machines.
Failures come from many sources. Software can fail unintentionally due to bugs or in-
tentionally to prevent a bad situation from getting worse. Hardware can also fail, with the
scope of the failure ranging from the smallest component to the largest network. Failure
domains can be any size: a device, a computer, a rack, a datacenter, or even an entire com-
pany.
Theamount ofcapacity inasystem is N + M ,where N istheamount ofcapacity usedto
provideaserviceand M istheamountofsparecapacity available, whichcanbeusedinthe
event of a failure. A system that is N + 1 fault tolerant can survive one unit of failure and
remain operational.
The most common way to route around failure is through replication of services. A ser-
vice may be replicated one or more times per failure domain to provide resilience greater
than the domain.
Failures can also come from external sources that overload a system, and from human
mistakes. There are countermeasures to nearly every failure imaginable. We can't anticip-
ate all failures, but we can plan for them, design solutions, prioritize their implementation,
and repeat the process.
Exercises
1. What are the major sources of failure in distributed computing systems?
2. What are the most common failures: software, hardware, or human? Justify your
answer.
3. Select one resiliency technique and give an example of a failure and the way in
which the resiliency technique would prevent a user-visible outage. Do this for one
technique in each of these sections: 6.5, 6.6, 6.7, and 6.8.
4. If a load balancer is being used, the system is automatically scalable and resilient.
Do you agree or disagree with this statement? Justify your answer.
5. Which resiliency techniques or technologies are in use in your environment?
6. Where would you like to add resiliency in your current environment? Describe
what you would change and which techniques you would apply.
7. In your environment, give an example of graceful degradation under load, or ex-
plain how you would implement it if it doesn't currently exist.
Search WWH ::




Custom Search