Failure - Building Software: A Practitioner's Guide - page 13

Information Technology Reference

In-Depth Information

in specified environments and with a desired confidence, can

be specified, designed in, predicted, tested and demonstrated.”

In material product engineering, reliability is extremely important. Manu-

facturers want to reduce their costs as much as possible, sometimes taking

the route of using inferior quality raw materials, or not adhering to

approved production standards. This may not be the right strategy because

it can increase the cost of supporting the product once it gets out in the

field.

Although the above may have been used in the context of purely

mechanical systems, it also has relevance to the software industry. Conscious

efforts must be made to weave reliability through the life cycle of the

development by incorporating reliability models in one's software right from

the product planning stage. The architecture and design (scalability, per-

formance, code reuse, third-party components, etc.) set the stage for the

reliability of the product. Further along, coding styles, adherence to good

software engineering practices, and adequate developer testing help in

creating reliable code. Subsequent testing, using meaningful in-house data

and field data (if the customer is using a system already), helps in ascer-

taining how reliably the software will work in a production environment.

An analysis of bugs discovered in the software, which requires proper

documentation of all bugs found in-house or at the customer site, irre-

spective of their severity (see Chapter 12 on quality), is another valuable

tool. Reliability engineering is known to use various distribution models

— Weibull being the most widely used — to get a good handle on the

success rates of manufactured products, and ensure that they balance the

business goals of the customer.

Accountability for Failure

Systems fail due to human error. Typical errors are often the result of

incomplete information to handle the system, or the failure to follow a

long set of procedures while using a system. Failures can also occur when

accountability has not been properly established in the operational hier-

archy of the system. If users or operators are not fully aware of their roles

and responsibilities (especially those who are monitoring the behavior of

the system), they tend to assume that reporting a potential problem may

not be

job. They assume that the other guy or the manager will

ultimately spot it. Critical problems can easily be missed this way. The

following is a good checklist to ensure that no part of the system is passed

over:

their

Next Page

Building Software: A Practitioner's Guide

Search WWH ::

Custom Search

Home