Fundamentals of Quantitative Design and Analysis - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

more commonplace, so architects must design systems to cope with these challenges. This sec-

tion gives a quick overview of the issues in dependability, leaving the official definition of the

terms and approaches to Section D.3 in Appendix D.

Computers are designed and constructed at different layers of abstraction. We can descend

recursively down through a computer seeing components enlarge themselves to full subsys-

tems until we run into individual transistors. Although some faults are widespread, like the

loss of power, many can be limited to a single component in a module. Thus, uter failure of

a module at one level may be considered merely a component error in a higher-level module.

This distinction is helpful in trying to find ways to build dependable computers.

One difficult question is deciding when a system is operating properly. This philosophical

point became concrete with the popularity of Internet services. Infrastructure providers star-

ted offering service level agreements (SLAs) or service level objectives (SLOs) to guarantee that

their networking or power service would be dependable. For example, they would pay the

customer a penalty if they did not meet an agreement more than some hours per month. Thus,

an SLA could be used to decide whether the system was up or down.

Systems alternate between two states of service with respect to an SLA:

1. Service accomplishment , where the service is delivered as specified2.

2. Service interruption , where the delivered service is different from the SLA

Transitions between these two states are caused by failures (from state 1 to state 2) or restora-

tions (2 to 1). Quantifying these transitions leads to the two main measures of dependability:

■ Module reliability is a measure of the continuous service accomplishment (or, equivalently,

of the time to failure) from a reference initial instant. Hence, the mean time to failure (MTTF)

is a reliability measure. The reciprocal of MTTF is a rate of failures, generally reported

as failures per billion hours of operation, or FIT (for failures in time ). Thus, an MTTF of

1,000,000 hours equals 10 9 /10 6 or 1000 FIT. Service interruption is measured as mean time to

repair (MTTR). Mean time between failures (MTBF) is simply the sum of MTTF + MTTR. Al-

though MTBF is widely used, MTTF is often the more appropriate term. If a collection of

modules has exponentially distributed lifetimes—meaning that the age of a module is not

important in probability of failure—the overall failure rate of the collection is the sum of

the failure rates of the modules.

■ Module availability is a measure of the service accomplishment with respect to the alterna-

tion between the two states of accomplishment and interruption. For nonredundant sys-

tems with repair, module availability is

Note that reliability and availability are now quantifiable metrics, rather than synonyms for

dependability. From these definitions, we can estimate reliability of a system quantitativelyif

if we make some assumptions about the reliability of components and that failures are inde-

pendent.

Example

Search WWH ::

Custom Search

Home