Hardware Reference
In-Depth Information
Fallacy Given Improvements In DRAM Dependability And The Fault Tolerance Of
WSC Systems Software, You Don't Need To Spend Extra For ECC Memory In A
WSC
Since ECC adds 8 bits to every 64 bits of DRAM, potentially you could save a ninth of the
DRAM costs by eliminating error-correcting code (ECC), especially since measurements of
DRAM had claimed failure rates of 1000 to 5000 FIT (failures per billion hours of operation)
per megabit [ Tezzaron Semiconductor 2004 ].
Schroeder, Pinheiro, and Weber [2009] studied measurements of the DRAMs with ECC pro-
tection at the majority of Google's WSCs, which was surely many hundreds of thousands of
servers, over a 2.5-year period. They found 15 to 25 times higher FIT rates than had been pub-
lished, or 25,000 to 70,000 failures per megabit. Failures affected more than 8% of DIMMs, and
the average DIMM had 4000 correctable errors and 0.2 uncorrectable errors per year. Meas-
ured at the server, about a third experienced DRAM errors each year, with an average of
22,000 correctable errors and 1 uncorrectable error per year. That is, for one-third of the serv-
ers, one memory error is corrected every 2.5 hours. Note that these systems used the more
powerful chipkill codes rather than the simpler SECDED codes. If the simpler scheme had
been used, the uncorrectable error rates would have been 4 to 10 times higher.
In a WSC that only had parity error protection, the servers would have to reboot for each
memory parity error. If the reboot time were 5 minutes, one-third of the machines would
spend 20% of their time rebooting! Such behavior would lower the performance of the $150M
facility by about 6%. Moreover, these systems would suffer many uncorrectable errors without
operators being notified that they occurred.
In the early years, Google used DRAM that didn't even have parity protection. In 2000, dur-
ing testing before shipping the next release of the search index, it started suggesting random
documents in response to test queries [ Barroso and Hölzle 2009 ]. The reason was a stuck-at-
zero fault in some DRAMs, which corrupted the new index. Google added consistency checks
to detect such errors in the future. As WSC grew in size and as ECC DIMMs became more af-
fordable, ECC became the standard in Google WSCs. ECC has the added benefit of making it
much easier to find broken DIMMs during repair.
Such data suggest why the Fermi GPU ( Chapter 4 ) adds ECC to its memory where its pre-
decessors didn't even have parity protection. Moreover, these FIT rates cast doubts on efforts
to use the Intel Atom processor in a WSC—due to its improved power efficiency—since the
2011 chip set does not support ECC DRAM.
Fallacy Turning Off Hardware During Periods Of Low Activity Improves
Cost-performance Of A WSC
Figure 6.14 on page 454 shows that the cost of amortizing the power distribution and cooling
infrastructure is 50% higher than the entire monthly power bill. Hence, while it certainly
would save some money to compact workloads and turn of idle machines, even if you could
save half the power it would only reduce the monthly operational bill by 7%. There would
also be practical problems to overcome, since the extensive WSC monitoring infrastructure idle
pends on being able to poke equipment and see it respond. Another advantage of energy pro-
portionality and active low power modes is that they are compatible with the WSC monitoring
infrastructure, which allows a single operator to be responsible for more than 1000 servers.
Search WWH ::




Custom Search