Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

a result, software is replaced often. New features can be introduced faster and more fre-

quently. It is easy to experiment. As software gets older, it gets stronger: Bugs are fixed;

rareedgecasesarehandledbetter.Spolsky's( 2004 ) essay,“ThingsYouShouldNeverDo,”

gives many examples.

Using better hardware, by comparison, is more expensive. The initial purchase price is

higher. More reliable CPUs, components, and storage systems are much more expensive

than commodity parts. This strategy is also more expensive because you pay the extra ex-

pense with each machine as you grow. Upgrading hardware has a per-machine cost for

the hardware itself, installation labor, capital depreciation, and the disposal of old parts.

Designing hardware takes longer, so upgrades become available less frequently and it is

more difficult to experiment and try new things. As hardware gets older, it becomes more

brittle and fails more often.

6.2 Everything Malfunctions Eventually

Malfunctions are a part of every environment. They can happen at every level. For ex-

ample, they happen at the component level (chips and other electronic parts), the device

level(harddrives,motherboards,networkinterfaces),andthesystemlevel(computers,net-

work equipment, power systems). Malfunctions also occur regionally: racks lose power,

entire datacenters go offline, cities and entire regions of the world are struck with disaster.

Humans are also responsible for malfunctions ranging from typos to software bugs, from

accidentally kicking a power cable out of its socket to intentionally malicious attacks.

6.2.1 MTBF in Distributed Systems

Largesystemsmagnifysmall problems.Inlargesystemsa“oneinamillion” problemhap-

pens a lot. A hard drive with an MTBF of 1 million hours has a 1 in 114 chance of failing

this year. If you have 100,000 such hard disks, you can expect two to fail every day.

A bug in a CPU that is triggered with a probability of one in 10 million might be why

your parents' home PC crashed once in 2010. They cursed, rebooted, and didn't think of it

again. Such abugwould behardly within the chip maker'sability todetect. That same bug

inadistributedcomputingsystem,however,wouldbeobservedfrequentlyenoughtoshow

up as a pattern in a crash detection and analysis system. It would be reported to the vendor,

which would be dismayed that it existed, shocked that anyone found it, and embarrassed

that it had been in the core CPU design for multiple chip generations. The vendor would

also be unlikely to give permission to have the specifics documented in a topic on system

administration.

Failures cluster so that it appears as if the machines are ganging up on us. Racks of ma-

chinestryingtobootatthesametimeafterapoweroutageexposemarginalpowersupplies

Search WWH ::

Custom Search

Home