Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

Failureisanormalpartofoperationsandcanoccuratanylevel.Largesystemsmagnify

theriskofsmallfailures.Aoneinamillionfailureisadailyoccurrenceifyouhaveenough

machines.

Failures come from many sources. Software can fail unintentionally due to bugs or in-

tentionally to prevent a bad situation from getting worse. Hardware can also fail, with the

scope of the failure ranging from the smallest component to the largest network. Failure

domains can be any size: a device, a computer, a rack, a datacenter, or even an entire com-

pany.

Theamount ofcapacity inasystem is N + M ,where N istheamount ofcapacity usedto

provideaserviceand M istheamountofsparecapacity available, whichcanbeusedinthe

event of a failure. A system that is N + 1 fault tolerant can survive one unit of failure and

remain operational.

The most common way to route around failure is through replication of services. A ser-

vice may be replicated one or more times per failure domain to provide resilience greater

than the domain.

Failures can also come from external sources that overload a system, and from human

mistakes. There are countermeasures to nearly every failure imaginable. We can't anticip-

ate all failures, but we can plan for them, design solutions, prioritize their implementation,

and repeat the process.

Exercises

1. What are the major sources of failure in distributed computing systems?

2. What are the most common failures: software, hardware, or human? Justify your

answer.

3. Select one resiliency technique and give an example of a failure and the way in

which the resiliency technique would prevent a user-visible outage. Do this for one

technique in each of these sections: 6.5, 6.6, 6.7, and 6.8.

4. If a load balancer is being used, the system is automatically scalable and resilient.

Do you agree or disagree with this statement? Justify your answer.

5. Which resiliency techniques or technologies are in use in your environment?

6. Where would you like to add resiliency in your current environment? Describe

what you would change and which techniques you would apply.

7. In your environment, give an example of graceful degradation under load, or ex-

plain how you would implement it if it doesn't currently exist.

Search WWH ::

Custom Search

Home