Information Technology Reference
In-Depth Information
tially dangerous and prevents further crashes by not sending it to the remaining leaf serv-
ers. Using this technique Google is able to achieve a measure of robustness in the face of
difficult-to-predict programming errors as well as malicious denial-of-service attacks.
6.6 Physical Failures
Distributedsystemsalsoneedtoberesilientwhenfacedwithphysicalfailures.Thephysic-
al devices used in a distributed system can fail on many levels. Physical failures can range
fromthesmallestelectronic componentallthewayuptoacountry'spowergrid.Providing
resiliency through the use of redundancy at every level is expensive and difficult to scale.
You need a strategy for providing resiliency against hardware failures without adding ex-
cessive cost.
6.6.1 Parts and Components
Manycomponentsofacomputercanfail.Thepartswhoseutilization youmonitorcanfail,
such as the CPU, the RAM, the disks, and the network interfaces. Supporting components
can also fail, such as fans, power supplies, batteries, and motherboards.
Historically,whentheCPUdied,theentiremachinewasunusable.Multiprocessorcom-
puters are now quite common, however, so it is more likely that a machine can survive so
long as one processor is still functioning. If the machine is already resilient in that way, we
must monitor for
N
+ 0 situations.
RAM
RAM often fails for strange reasons. Sometimes a slight power surge can affect RAM.
Other times a single bit flips its value because a cosmic ray from another star system just
happened to fly through it. Really!
Many memory systems store with each byte an additional bit (a
parity bit
) that enables
them to detect errors, or two additional bits (
error-correcting code
or ECC memory) that
enable them to perform error correction. This adds cost. It also drags down reliability be-
cause now there are 25 percent more bits and, therefore, the MTTF becomes 25 percent
worse. (Although most of these failures are now corrected invisibly, the failures are still
happening and can be detected via monitoring systems. If the failures persist, the compon-
ent needs to be replaced.)
When writing to parity bit memory, the system counts how many 1 bits are in the byte
andstoresa0intheparitybitifthetotaliseven,ora1ifthetotalisodd.Anytimememory
is read, the parity is checked and mismatches are reported to the operating system. This is
sufficient to detect all single-bit errors, or any multiple-bit errors that do not preserve par-
Search WWH ::
Custom Search