Design Patterns for Resiliency - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

ity. ECC memory uses two additional bits and Hamming code algorithms that can correct

single-bit errors and detect multiple-bit errors.

The likelihood of two or more bit errors increases the longer that values sit in memory

unread and the more RAM there is in a system.

One can save money by having no parity or ECC bits—an approach commonly used

with low-end chipsets—but then all software has to do its own checksumming and error

correction. This is slow and costly, and you or your developers probably won't do it. So

spend the money on ECC, instead.

Disks

Disks fail often because they have moving parts. Solid-state drives (SSDs), which have no

moving parts, wear out since each block is rated to be written only a certain number of

times.

The usual solution is to use RAID level 1 or higher to achieve N + 1 redundancy or

better. However, RAID systems are costly and their internal firmware is often a source of

frustration, as it is difficult to configure without interrupting service. (A full explanation

of RAID levels is not included here but can be found in our other topic, The Practice of

System and Network Administration. )

File systems such as ZFS, BTFS, and Hadoop HDFS store data reliably by providing

their own RAID or RAID-like functionality. In those cases hardware RAID controllers are

not needed.

We recommend the strategic use of RAID controllers, deploying them only where re-

quired. For example, a widely used distributed computing environment is the Apache Ha-

doop system. The first three machines in a Hadoop cluster are special master service ma-

chines that store critical configuration information. This information is not replicated and

is difficult to rebuild if lost. The other machines in a Hadoop cluster are data nodes that

store replicas of data. In this environment RAID is normally used on the master machines.

Implementing RAID there has a fixed cost, as no more than three machines with RAID

controllers are needed. Data nodes are added when more capacity is needed. They are built

without RAID since Hadoop replicates data as needed, detecting failures and creating new

replicasasneeded.Thisstrategyhasacostbenefitinthattheexpensivehardwareisafixed

quantity while the nodes used to expand the system are the inexpensive ones.

Search WWH ::

Custom Search

Home