Information Technology Reference
In-Depth Information
with unrecoverable read errors, reconstructing the lost data from the remaining
disks in the RAID array, and attempting to write and read the reconstructed
data to and from the suspect sector. If writes and reads succeed, then the error
was caused by a transient fault, and the disk continues to use the sector, but
if the sector cannot be successfully accessed, the error is permanent, and the
system remaps that sector to a spare and writes the reconstructed data there.
Reducing nonrecoverable read error rates with more reliable disks.
Different disk models promise significantly different nonrecoverable read error
rates. In particular, in 2011, many disks aimed at laptops and personal comput-
ers claim unrecoverable read error rates of one per 10 14 bits read, while disks
aimed at enterprise servers often have lower storage densities but can promis
unrecoverable read error rates of one per 10 16 bits read. This two order of mag-
nitude improvement greatly reduces the probability that a RAID system loses
data from a combination of a full disk failure and a nonrecoverable read error
during recovery.
Reducing mean time to repair with hot spares. Some systems include
\hot spare" disk drives that are idle, but plugged into a server so that if one of
the server's disks fails, the hot spare can be automatically activated to replace
the lost disk.
Note that even with hot spares, the mean time to repair a disk is limited by
the time it takes to write the reconstructed data to it, and this time is often
measured in hours. For example, if we have a 1 TB disk and can write at
100 MB/s, the mean time to repair for the disk will be at least 10 4 seconds|
about 3 hours. In practice, repair time may be even larger if the bandwidth
achieved is less than assumed here.
Reducing mean time to repair with declustering. Disks with hundreds
of gigabytes to a few terabytes can take hours to fully write with reconstructed
data. Declustering splits reconstruction of a failed disk across multiple disks.
Definition: declustering
Declustering thus allows parallel reconstruction, thus speeding up reconstruction
and reducing MTTR.
For example, the Hadoop File System (HDFS) is a cluster file system that
writes each data block to three out of potentially hundreds or thousands of
disks. It chooses the three disks for each block more or less randomly. If
one disk fails, it rereplicates the lost blocks approximately randomly across
the remaining disks. If we have N disks each with a bandwidth of B, total
reconstruction bandwidth can approach N 2 B; for example, if there are 1000
disks with 100 MB/s bandwidths, reconstruction bandwidth can theoretically
approach 500 GB/s, allowing rereplication of a 1 TB disk's data in a few seconds.
In practice, rereplication will be slower than this for at least three reasons.
First, resources other than the disk (e.g., the network) may bottleneck recovery.
 
Search WWH ::




Custom Search