Information Technology Reference
In-Depth Information
If a disk suffers a whole-disk failure, an operator replaces the failed disk, and
the RAID system reconstructs all of the disk's data from the other disk(s) and
rewrites the data to the replacement disk. The average time from when a disk
fails until it has been replaced and rewritten is called the mean time to repair
Denition: mean time to
repair
(MTTR.)
Definition: MTTR
RAID reliability
A RAID with one redundant disk per group (e.g., mirroring or rotating parity
RAIDs) can lose data in three ways: two full disk failures, a full disk faiulre
and one or more sector failures on other disks, and overlapping sector failures
on multiple disks. The expected time until one of these events occurs is called
the mean time to data loss (MTTDL.)
Denition: mean time to
data loss
Definition: MTTDL
Two full-disk failures.
If two disks fail, the system will be unable to recon-
struct the missing data.
To get a sense of how serious a problem this might be, suppose that a
system has N disks with one parity block per G blocks, and suppose that disks
fail independently with a mean time to failure of MTTF and a mean time to
replace a failed disk and recover its data of MTTR.
Then, when the system is operating properly, the expected time until the
rst failure is MTTF=N. Assuming MTTR << MTTF, there is essentially a
race to replace the disk and reconstruct its data before a second disk fails. We
lose this race and hit the second failure before the repair is done with probability
MTTF=(G1)
MTTR , giving us a mean time to data loss from multiple full-disk failures
of
MTTF 2
N(G 1)MTTR
MTTDL twofulldisk =
Example: Mean time to double-disk failure.
Question: Suppose you have 100 disks organized into groups of 10, with
one disk storing a parity block per nine disks storing data blocks.
Assuming that disk failures are independent and the per-disk
mean time to failure is 10 6 hours and assuming that the mean
time to repair a failed disk is 10 hours, estimate the expected
mean time to data loss due to a double-disk failure.
Answer: Because failures are assumed to occur independently and at a
constant rate, we can use the equation above:
 
Search WWH ::




Custom Search