Database Reference
In-Depth Information
process pairs 31 mirroring of all computation is such a scheme that would halt
the degradation in utilization at 50%. 28
PDSI interest in large-scale cluster node failure originated in the key role of
high-bandwidth storage in checkpoint/restart strategies for application fault
tolerance. 32 Although storage failures are often masked from interrupting ap-
plications by RAID technology, 4 reconstructing a failed disk can impact stor-
age performance noticeably. 33 If too many failures occur, storage system re-
covery tools can take days to bring a large file system back online, perhaps
without all of its users' precious data. Moreover, disks have traditionally been
viewed as perhaps the least reliable hardware component, due to the mechan-
ical aspects of a disk. Datasets obtained describe disk drive failures occurring
at HPC sites and at a large Internet service provider. The datasets vary in
duration from one month to five years; cover more than 100,000 hard drives
from four different vendors; and include SCSI, fibre channel, and SATA disk
drives. For more detailed results see Reference 34.
For modern drives, the datasheet MTTFs (mean times to failure) are typ-
ically in the range of 1-1.5 million hours, suggesting an annual failure and
replacement rate (ARR) between 0.58% and 0.88%. In the data, however,
field experience with disk replacements differs from datasheet specifications
of disk reliability. Figure 2.10 shows the annual failure rate suggested by the
datasheets (horizontal solid and dashed line), the observed ARRs for each of
the datasets, and the weighted average ARR for all disks less than five years
old (dotted line). The figure shows a significant discrepancy between the ob-
served ARR and the datasheet value for all datasets, with the former as high
as 13.5%. That is, the observed ARRs are a factor of 15 higher than datasheets
would indicate. The average observed ARR over all datasets (weighted by the
number of drives in each dataset) is 3.01%. Even after removing all COM3
6
Avrg. ARR
ARR_0.88
ARR_0.58
5
4
3
2
1
0
HPC1
HPC2
HPC3
HPC4
COM1
COM2
COM3
Figure 2.10 Comparison of data sheet annual failure rates (horizontal dot-
ted lines) and the observed annual replacement rates of disks in the field.
Search WWH ::




Custom Search