Information Technology Reference
In-Depth Information
Chapter 6. Design Patterns for Resiliency
Success is not final, failure is not fatal: it is the courage to continue that counts.
—Winston Churchill
Resiliencyisasystem'sabilitytoconstructivelydealwithfailures.Aresilientsystemdetects
failure and routes around it. Nonresilient systems fall down when faced with a malfunction.
Thischapterisaboutsoftware-basedresiliencyanddocumentsthemostcommontechniques
used.
Resiliency is important because no one goes to a web site that is down. Hardware
fails—that is a fact of life. You can buy the most reliable, expensive hardware in the world
and there will be some amount of failures. In a sufficiently large system, a one in a million
failure is a daily occurrence.
During the first year of a typical Google datacenter, there will be five rack-wide outages,
three router failures large enough to require diverting processing away from connected ma-
chines, and eight network scheduled maintenance windows, half of which cause 30-minute
random connectivity losses. At the same time 1 to 5 percent of all disks will die and each
machine will crash at least twice (2 to 4 percent failure rate) ( Dean 2009 ).
Graceful degradation, discussed previously, means software is designed to survive fail-
ures or periods of high load by providing reduced functionality. For example, a movie
streaming service might automatically reduce video resolution to conserve bandwidth when
some of its internet connections are down or otherwise overloaded. The other strategy is de-
fense in depth , which means that all layers of design detect and respond the failures. This
includes failures as small as a single process and as large as an entire datacenter.
An older, more traditional strategy for achieving reliability is to reduce the chance of
failure at every place it can happen. Use the best servers and the best network equipment,
and put it in the most reliable datacenter: There will still be outages when this strategy is
pursued, but they will be rare. This is the most expensive strategy. Another strategy is to
perform a dependency analysis and verify that each system depends on high-quality parts.
Manufacturers calculate their components' reliability and publish their mean time between
failure (MTBF) ratings. By analyzing the dependencies within the system, one can predict
MTBF for the entire system. The MTBF of the system is only as high as that of its lowest-
MTBF part.
Search WWH ::




Custom Search