Hardware Reference
In-Depth Information
1,000,000 drives of this model are in service and all 1,000,000 are running simultaneously, you can
expect one failure out of this entire population every half-hour. MTBF statistics are not useful for
predicting the failure of an individual drive or a small sample of drives.
You also need to understand the meaning of the word failure . In this sense, a failure is a fault that
requires the drive to be returned to the manufacturer for repair, not an occasional failure to read or
write a file correctly.
Finally, as some drive manufacturers point out, this measure of MTBF should really be called mean
time to first failure. “Between failures” implies that the drive fails, is returned for repair, and then at
some point fails again. The interval between repair and the second failure here would be the MTBF.
In most cases, a failed hard drive that would need manufacturer repair is replaced rather than
repaired, so the whole MTBF concept is misnamed.
The bottom line is that I do not really place much emphasis on MTBF figures. For an individual
drive, they are not accurate predictors of reliability. However, if you are an information systems
manager considering the purchase of thousands of PCs or drives per year or a system vendor building
and supporting thousands of systems, it might be worth your while to examine these numbers and
study the methods used to calculate them by each vendor. Most hard drive manufacturers designate
their premium drives as Enterprise class drives, meaning they are designed for use in environments
requiring full-time usage and high reliability and carry the highest MTBF ratings. If you can
understand the vendor's calculations and compare the actual reliability of a large sample of drives,
you can purchase more reliable drives and save time and money in service and support.
S.M.A.R.T.
Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) is an industry standard providing
failure prediction for disk drives. When S.M.A.R.T. is enabled for a given drive, the drive monitors
predetermined attributes that are susceptible to or indicative of drive degradation. Based on changes
in the monitored attributes, a failure prediction can be made. If a failure is deemed likely to occur,
S.M.A.R.T. makes a status report available so the system BIOS or driver software can notify the user
of the impending problems, perhaps enabling the user to back up the data on the drive before any real
problems occur.
Predictable failures are the types of failures S.M.A.R.T. attempts to detect. These failures result from
the gradual degradation of the drive's performance. According to Seagate, 60% of drive failures are
mechanical, which is exactly the type of failures S.M.A.R.T. is designed to predict.
Of course, not all failures are predictable, and S.M.A.R.T. can't help with unpredictable failures that
occur without advance warning. These can be caused by static electricity, improper handling or
sudden shock, or circuit failure (such as thermal-related solder problems or component failure).
S.M.A.R.T. was originally created by IBM in 1992. That year IBM began shipping 3 1/2-inch HDDs
equipped with Predictive Failure Analysis (PFA), an IBM-developed technology that periodically
measures selected drive attributes and sends a warning message when a predefined threshold is
exceeded. IBM turned this technology over to the American National Standards Institute (ANSI)
organization, and it subsequently became the ANSI-standard S.M.A.R.T. protocol for SCSI drives, as
defined in the ANSI-SCSI Informational Exception Control (IEC) document X3T10/94-190.
Interest in extending this technology to ATA drives led to the creation of the S.M.A.R.T. Working
Group in 1995. Besides IBM, other companies represented in the original group were Seagate
Technology, Conner Peripherals (now a part of Seagate), Fujitsu, Hewlett-Packard, Maxtor (now a
 
Search WWH ::




Custom Search