Reliable Storage - Operating Systems: Principles and Practice

Information Technology Reference

In-Depth Information

Second, the system may throttle recovery speed to avoid starving user requests.

Third, if a server crashes and its disks become inaccessible, the system may delay

starting recovery|hoping that the server will soon recover|to avoid imposing

extra load on the system.

Pitfalls

When constructing a reliable storage system, it is not enough to plug provide

enough redundancy to tolerate a target number of failures. We also need to

consider how failures are likely to occur (e.g., they may be correlated) and what

it takes to correct them (e.g., succesfully reading a lot of other data.)

More

specifically, be aware of the following pitfalls:

Assuming uncorrelated failures.. It is easy to get gaudy MTTDL

numbers by adding a redundant device or two and multiplying the de-

vices' MTTFs. But the simple equation on page 432 only applies when

failures are uncorrelated. Even a 1% chance of correlated failures dramat-

ically changes the estimate. Unfortunately, it is often dicult to estimate

correlation rates a priori, so designers must sometimes just add a signifi-

cant safety margin and hope that it is enough.

Ignoring the risk from latent errors.. It is not uncommon to see

analyses of RAID reliability that considers full device failures but not

nonrecoverable read failures. As we have seen above, nonrecoverable read

errors can dramatically reduce the probability of successfully recovering

data after a disk failure.

Not implementing scrubbing.. Periodically scrubbing disks to detect

and correct latent errors can significantly reduce the risk of data loss.

Although it can be dicult to predict the appropriate scrubbing frequency

a priori, a system that uses scrubbing can monitor the rate at which

noncorrectable read errors are found and corrected and use the measured

rate to adjust the scrubbing frequency.

Not having a backup.. The techniques discussed in this section can pro-

tect a system against many, but not all, faults. For example, a widespread

correlated failure (e.g., a building burning down), an operator error (e.g.,

\rm -r *"), or a software bug could corrupt or delete data stored across

any number of redundant devices.

A backup system provides storage that is separate from a system's main

Denition: backup

storage. Ideally, the separation is both physical and logical.

Physical separation means that backup storage devices are in dierent

Denition: physical

separation

locations than the primary storage devices. For example, some systems

achieve physical separation by copying data to tape and storing the tapes

in a different building than the main storage servers.

Other systems

Operating Systems: Principles and Practice

Search WWH ::

Custom Search

Home