Information Technology Reference
In-Depth Information
a result, software is replaced often. New features can be introduced faster and more fre-
quently. It is easy to experiment. As software gets older, it gets stronger: Bugs are fixed;
rareedgecasesarehandledbetter.Spolsky's( 2004 ) essay,“ThingsYouShouldNeverDo,”
gives many examples.
Using better hardware, by comparison, is more expensive. The initial purchase price is
higher. More reliable CPUs, components, and storage systems are much more expensive
than commodity parts. This strategy is also more expensive because you pay the extra ex-
pense with each machine as you grow. Upgrading hardware has a per-machine cost for
the hardware itself, installation labor, capital depreciation, and the disposal of old parts.
Designing hardware takes longer, so upgrades become available less frequently and it is
more difficult to experiment and try new things. As hardware gets older, it becomes more
brittle and fails more often.
6.2 Everything Malfunctions Eventually
Malfunctions are a part of every environment. They can happen at every level. For ex-
ample, they happen at the component level (chips and other electronic parts), the device
level(harddrives,motherboards,networkinterfaces),andthesystemlevel(computers,net-
work equipment, power systems). Malfunctions also occur regionally: racks lose power,
entire datacenters go offline, cities and entire regions of the world are struck with disaster.
Humans are also responsible for malfunctions ranging from typos to software bugs, from
accidentally kicking a power cable out of its socket to intentionally malicious attacks.
6.2.1 MTBF in Distributed Systems
Largesystemsmagnifysmall problems.Inlargesystemsa“oneinamillion” problemhap-
pens a lot. A hard drive with an MTBF of 1 million hours has a 1 in 114 chance of failing
this year. If you have 100,000 such hard disks, you can expect two to fail every day.
A bug in a CPU that is triggered with a probability of one in 10 million might be why
your parents' home PC crashed once in 2010. They cursed, rebooted, and didn't think of it
again. Such abugwould behardly within the chip maker'sability todetect. That same bug
inadistributedcomputingsystem,however,wouldbeobservedfrequentlyenoughtoshow
up as a pattern in a crash detection and analysis system. It would be reported to the vendor,
which would be dismayed that it existed, shocked that anyone found it, and embarrassed
that it had been in the core CPU design for multiple chip generations. The vendor would
also be unlikely to give permission to have the specifics documented in a topic on system
administration.
Failures cluster so that it appears as if the machines are ganging up on us. Racks of ma-
chinestryingtobootatthesametimeafterapoweroutageexposemarginalpowersupplies
Search WWH ::




Custom Search