Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

tially dangerous and prevents further crashes by not sending it to the remaining leaf serv-

ers. Using this technique Google is able to achieve a measure of robustness in the face of

difficult-to-predict programming errors as well as malicious denial-of-service attacks.

6.6 Physical Failures

Distributedsystemsalsoneedtoberesilientwhenfacedwithphysicalfailures.Thephysic-

al devices used in a distributed system can fail on many levels. Physical failures can range

fromthesmallestelectronic componentallthewayuptoacountry'spowergrid.Providing

resiliency through the use of redundancy at every level is expensive and difficult to scale.

You need a strategy for providing resiliency against hardware failures without adding ex-

cessive cost.

6.6.1 Parts and Components

Manycomponentsofacomputercanfail.Thepartswhoseutilization youmonitorcanfail,

such as the CPU, the RAM, the disks, and the network interfaces. Supporting components

can also fail, such as fans, power supplies, batteries, and motherboards.

Historically,whentheCPUdied,theentiremachinewasunusable.Multiprocessorcom-

puters are now quite common, however, so it is more likely that a machine can survive so

long as one processor is still functioning. If the machine is already resilient in that way, we

must monitor for N + 0 situations.

RAM

RAM often fails for strange reasons. Sometimes a slight power surge can affect RAM.

Other times a single bit flips its value because a cosmic ray from another star system just

happened to fly through it. Really!

Many memory systems store with each byte an additional bit (a parity bit ) that enables

them to detect errors, or two additional bits ( error-correcting code or ECC memory) that

enable them to perform error correction. This adds cost. It also drags down reliability be-

cause now there are 25 percent more bits and, therefore, the MTTF becomes 25 percent

worse. (Although most of these failures are now corrected invisibly, the failures are still

happening and can be detected via monitoring systems. If the failures persist, the compon-

ent needs to be replaced.)

When writing to parity bit memory, the system counts how many 1 bits are in the byte

andstoresa0intheparitybitifthetotaliseven,ora1ifthetotalisodd.Anytimememory

is read, the parity is checked and mismatches are reported to the operating system. This is

sufficient to detect all single-bit errors, or any multiple-bit errors that do not preserve par-

Search WWH ::

Custom Search

Home