Information Technology Reference
In-Depth Information
Modeling Real Systems
The equations in the main text for estimating a system's mean time to data loss
are only applicable if failure rates are constant and if failures are uncorrelated. Un-
fortunately, emparical studies often observe correlation among full-disk failures, among
sector-level failures, and between sector-level and full-disk failures, and they frequently
find failure rates that vary significantly with disks' ages. Unfortunately, if failure rates
vary over time or failures are correlated, the the failure arrival distribution is no lonver
described by an exponential distribution, and the math quickly gets difficult.
One solution is to use randomized simulation to estimate the probability of data loss
over some duration of interest. For example, we might want to estimate the probability
of losing data over 10 years for a 1000-disk system organized in groups of 10 disks with
rotating parity.
To do this, our simulation would track which disks are functioning normally, which
have latent sector errors, and which have suffered full disk failures. The transitions
between states could be based on measurement studies or field data on key factors like
(a) the rate that disks suffer full disk failures (possibly dependent on the disks' ages, the
number of recent full disk failures, or the number of individual sector failures a disk has
had), (b) the rate at which sector failures arise (possibly dependent on the age of the
disk, workload of the disk, and recent frequency of sector failures), (c) the repair time
when a disk fails, and (d) the expected time for scrubbing to detect and repair a sector
error.
To estimate the probability of data loss, we would repeatedly simulate the system for
a decade and count the number of times the system enters a state in which data is lost
(i.e., a group has two full disk failures or has both a full disk failure and a sector failure
on another disk.)
Data
Integrity
Segment
(DIS)
Data Block
4
4096
448
unused
512
512
512
512
512
512
512
512
512
Disk Sectors
Figure14.7: To improve reliability Network Appliance's WAFL le system
stores a 64 byte data integrity segment (DIS) with each 4 KB data block.
Search WWH ::




Custom Search