Reliable Storage - Operating Systems: Principles and Practice

Information Technology Reference

In-Depth Information

rates, and (3) reduce mean time to repair. All of these approaches, in various

combinations, are used in practice.

Here are some common approaches:

Increasing redundancy with more redundant disks. Rather than having

a single redundant block per group (e.g., using two mirrored disks or using one

parity disk for each stripe) sysems can use double redundancy (e.g., three disk

replicas or two error correction disks for each stripe.) In some cases, systems

may use even more redundancy. For example, the Google File System (GFS)

is designed to provide highly reliable and available storage across thousands of

disks; by default GFS stores each data block on three different disks.

A dual redundancy array is sometimes called RAID 6 . To ensure that data

Denition: dual

redundancy array

Definition: RAID 6

can be reconstructed despite any two failures in a stripe, error blocks are gen-

erated using erasure codes such as Reed Solomon codes.

A system with dual redundancy can be much more reliable than a simple

single redundancy RAID. With dual redundancy, the most likely data loss sce-

narios are (a) three full-disk failures or (b) a double-disk failure combined with

one or more nonrecoverable read errors.

If we optimistically assume that failures are independent and occur at a

constant rate, a system with two redundant disks per stripe has a potentially

low combined data loss rate:

N

MTTF

MTTR(G 1)

MTTF

( MTTR(G 2)

MTTF

FailureRate dual+indep+const =

+ P failrecoveryread )

This data loss rate is nearly MTTF

MTTR(G1) times better than the single-parity

data loss rate; for disks with MTTFs of over one million hours, MTTRs of

under 10 hours, and groups sizes of ten or fewer disks, double parity improves

the estimated rate by about a factor of 10,000.

We emphasize, however, that the above equation almost certainly underes-

timates the likely data loss rate for real systems, which may suffer correlated

failures, varying failure rates, higher failure rates than advertised, and so on.

Reducing nonrecoverable read error rates with scrubbing. A storage

device's sector-level error rates are typically expressed as a single nonrecoverable

read rates, suggesting that the rate is constant. The reality is more complex.

Depending on the device, errors may accumulate over time and heavier work-

loads may increase the rate that errors accumulate.

An important technique for reducing a disk's nonrecoverable read rate is

scrubbing : periodically reading the entire contents of a disk, detecting sectors

Denition: scrubbing

Operating Systems: Principles and Practice

Search WWH ::

Custom Search

Home