Information Technology Reference
In-Depth Information
ity. ECC memory uses two additional bits and Hamming code algorithms that can correct
single-bit errors and detect multiple-bit errors.
The likelihood of two or more bit errors increases the longer that values sit in memory
unread and the more RAM there is in a system.
One can save money by having no parity or ECC bits—an approach commonly used
with low-end chipsets—but then all software has to do its own checksumming and error
correction. This is slow and costly, and you or your developers probably won't do it. So
spend the money on ECC, instead.
Disks
Disks fail often because they have moving parts. Solid-state drives (SSDs), which have no
moving parts, wear out since each block is rated to be written only a certain number of
times.
The usual solution is to use RAID level 1 or higher to achieve N + 1 redundancy or
better. However, RAID systems are costly and their internal firmware is often a source of
frustration, as it is difficult to configure without interrupting service. (A full explanation
of RAID levels is not included here but can be found in our other topic, The Practice of
System and Network Administration. )
File systems such as ZFS, BTFS, and Hadoop HDFS store data reliably by providing
their own RAID or RAID-like functionality. In those cases hardware RAID controllers are
not needed.
We recommend the strategic use of RAID controllers, deploying them only where re-
quired. For example, a widely used distributed computing environment is the Apache Ha-
doop system. The first three machines in a Hadoop cluster are special master service ma-
chines that store critical configuration information. This information is not replicated and
is difficult to rebuild if lost. The other machines in a Hadoop cluster are data nodes that
store replicas of data. In this environment RAID is normally used on the master machines.
Implementing RAID there has a fixed cost, as no more than three machines with RAID
controllers are needed. Data nodes are added when more capacity is needed. They are built
without RAID since Hadoop replicates data as needed, detecting failures and creating new
replicasasneeded.Thisstrategyhasacostbenefitinthattheexpensivehardwareisafixed
quantity while the nodes used to expand the system are the inexpensive ones.
Search WWH ::




Custom Search