Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

Longitudinal Studies on Hardware Failures

Google has published to two longitudinal studies of hardware failures. Most stud-

ies of such failures are done in laboratory environments. Google meticulously col-

lectscomponentfailureinformationonitsentirefleetofmachines,providingprob-

ably the best insight into actual failure patterns. Both studies are worth reading.

“Failure Trends in a Large Disk Drive Population” ( Pinheiro, Weber & Barroso

2007 ) analyzed a large population of hard disks over many years. The authors did

not find temperature or activity levels to correlate with drive failures. They found

that after a single scan error was detected, drives are 39 times more likely to fail

within the next 60 days. They discovered the “bathtub failure curve” where fail-

ures tend to happen either in the first month or only many years later.

“DRAM Errors in the Wild: A Large-Scale Field Study” ( Schroeder, Pinheiro

& Weber 2009 ) analyzed memory errors in a large fleet of machines in datacenters

overaperiodof2.5years.Theseauthorsfoundthaterrorrateswereordersofmag-

nitude higher than previously reported and were dominated by hard errors—the

kindthat ECC candetect butnotcorrect. Temperature hadcomparatively small ef-

fect compared to other factors.

6.6.2 Machines

Machine failures are generally the result of components that have died. If the system has

subsystems that are N + 1, a double failure results in machine death.

Amachinethatcrasheswilloftencomebacktolifeifitispowercycledoffandbackon,

often with a delay to let the components drain. This process can be automated, although it

is important that the automation be able to distinguish between not being able to reach the

machine and the machine being down.

If a power cycle does not revive the machine, the machine must be diagnosed, repaired,

and brought back into service. Much of this can be automated, especially the reinstallation

of the operating system. This topic is covered in more detail in Section 10.4.1 .

Earlier we described situations where machines fail to boot up after a power outage.

These problems can be discovered preemptively by periodically rebooting them. For ex-

ample,Googledrainsmachinesonebyoneforkernelupgrades.Asaresultofthispractice,

each machine is rebooted in a controlled way approximately every three months. This re-

duces the number of surprises found during power outages.

Search WWH ::

Custom Search

Home