Information Technology Reference
In-Depth Information
Longitudinal Studies on Hardware Failures
Google has published to two longitudinal studies of hardware failures. Most stud-
ies of such failures are done in laboratory environments. Google meticulously col-
lectscomponentfailureinformationonitsentirefleetofmachines,providingprob-
ably the best insight into actual failure patterns. Both studies are worth reading.
“Failure Trends in a Large Disk Drive Population” ( Pinheiro, Weber & Barroso
2007 ) analyzed a large population of hard disks over many years. The authors did
not find temperature or activity levels to correlate with drive failures. They found
that after a single scan error was detected, drives are 39 times more likely to fail
within the next 60 days. They discovered the “bathtub failure curve” where fail-
ures tend to happen either in the first month or only many years later.
“DRAM Errors in the Wild: A Large-Scale Field Study” ( Schroeder, Pinheiro
& Weber 2009 ) analyzed memory errors in a large fleet of machines in datacenters
overaperiodof2.5years.Theseauthorsfoundthaterrorrateswereordersofmag-
nitude higher than previously reported and were dominated by hard errors—the
kindthat ECC candetect butnotcorrect. Temperature hadcomparatively small ef-
fect compared to other factors.
6.6.2 Machines
Machine failures are generally the result of components that have died. If the system has
subsystems that are N + 1, a double failure results in machine death.
Amachinethatcrasheswilloftencomebacktolifeifitispowercycledoffandbackon,
often with a delay to let the components drain. This process can be automated, although it
is important that the automation be able to distinguish between not being able to reach the
machine and the machine being down.
If a power cycle does not revive the machine, the machine must be diagnosed, repaired,
and brought back into service. Much of this can be automated, especially the reinstallation
of the operating system. This topic is covered in more detail in Section 10.4.1 .
Earlier we described situations where machines fail to boot up after a power outage.
These problems can be discovered preemptively by periodically rebooting them. For ex-
ample,Googledrainsmachinesonebyoneforkernelupgrades.Asaresultofthispractice,
each machine is rebooted in a controlled way approximately every three months. This re-
duces the number of surprises found during power outages.
Search WWH ::




Custom Search