Reliable Storage - Operating Systems: Principles and Practice

Information Technology Reference

In-Depth Information

If a disk suffers a whole-disk failure, an operator replaces the failed disk, and

the RAID system reconstructs all of the disk's data from the other disk(s) and

rewrites the data to the replacement disk. The average time from when a disk

fails until it has been replaced and rewritten is called the mean time to repair

Denition: mean time to

repair

(MTTR.)

Definition: MTTR

RAID reliability

A RAID with one redundant disk per group (e.g., mirroring or rotating parity

RAIDs) can lose data in three ways: two full disk failures, a full disk faiulre

and one or more sector failures on other disks, and overlapping sector failures

on multiple disks. The expected time until one of these events occurs is called

the mean time to data loss (MTTDL.)

Denition: mean time to

data loss

Definition: MTTDL

Two full-disk failures.

If two disks fail, the system will be unable to recon-

struct the missing data.

To get a sense of how serious a problem this might be, suppose that a

system has N disks with one parity block per G blocks, and suppose that disks

fail independently with a mean time to failure of MTTF and a mean time to

replace a failed disk and recover its data of MTTR.

Then, when the system is operating properly, the expected time until the

rst failure is MTTF=N. Assuming MTTR << MTTF, there is essentially a

race to replace the disk and reconstruct its data before a second disk fails. We

lose this race and hit the second failure before the repair is done with probability

MTTF=(G1)

MTTR , giving us a mean time to data loss from multiple full-disk failures

of

MTTF 2

N(G 1)MTTR

MTTDL twofulldisk =

Example: Mean time to double-disk failure.

Question: Suppose you have 100 disks organized into groups of 10, with

one disk storing a parity block per nine disks storing data blocks.

Assuming that disk failures are independent and the per-disk

mean time to failure is 10 6 hours and assuming that the mean

time to repair a failed disk is 10 hours, estimate the expected

mean time to data loss due to a double-disk failure.

Answer: Because failures are assumed to occur independently and at a

constant rate, we can use the equation above:

Operating Systems: Principles and Practice

Search WWH ::

Custom Search

Home