Information Technology Reference
In-Depth Information
Second, the system may throttle recovery speed to avoid starving user requests.
Third, if a server crashes and its disks become inaccessible, the system may delay
starting recovery|hoping that the server will soon recover|to avoid imposing
extra load on the system.
Pitfalls
When constructing a reliable storage system, it is not enough to plug provide
enough redundancy to tolerate a target number of failures. We also need to
consider how failures are likely to occur (e.g., they may be correlated) and what
it takes to correct them (e.g., succesfully reading a lot of other data.)
More
specifically, be aware of the following pitfalls:
Assuming uncorrelated failures.. It is easy to get gaudy MTTDL
numbers by adding a redundant device or two and multiplying the de-
vices' MTTFs. But the simple equation on page 432 only applies when
failures are uncorrelated. Even a 1% chance of correlated failures dramat-
ically changes the estimate. Unfortunately, it is often dicult to estimate
correlation rates a priori, so designers must sometimes just add a signifi-
cant safety margin and hope that it is enough.
Ignoring the risk from latent errors.. It is not uncommon to see
analyses of RAID reliability that considers full device failures but not
nonrecoverable read failures. As we have seen above, nonrecoverable read
errors can dramatically reduce the probability of successfully recovering
data after a disk failure.
Not implementing scrubbing.. Periodically scrubbing disks to detect
and correct latent errors can significantly reduce the risk of data loss.
Although it can be dicult to predict the appropriate scrubbing frequency
a priori, a system that uses scrubbing can monitor the rate at which
noncorrectable read errors are found and corrected and use the measured
rate to adjust the scrubbing frequency.
Not having a backup.. The techniques discussed in this section can pro-
tect a system against many, but not all, faults. For example, a widespread
correlated failure (e.g., a building burning down), an operator error (e.g.,
\rm -r *"), or a software bug could corrupt or delete data stored across
any number of redundant devices.
A backup system provides storage that is separate from a system's main
Denition: backup
storage. Ideally, the separation is both physical and logical.
Physical separation means that backup storage devices are in dierent
Denition: physical
separation
locations than the primary storage devices. For example, some systems
achieve physical separation by copying data to tape and storing the tapes
in a different building than the main storage servers.
Other systems
 
Search WWH ::




Custom Search