Computer Reliability - Ethics for the Information Age

Information Technology Reference

In-Depth Information

raised. Unfortunately, there was no exception-handling mechanism for this particular

exception, so the onboard computers crashed.

The faulty piece of code had been part of the software for the Ariane 4. The 64-

bit floating-point value represented the horizontal bias of the launch vehicle, which is

related to its horizontal velocity. When the software module was designed, engineers

determined that it would be impossible for the horizontal bias to be so large that it could

not be stored in a 16-bit signed integer. There was no need for an error handler, because

an error could not occur. This code was moved “as is” into the software for the Ariane

5. That proved to be an extremely costly mistake, because the Ariane 5 was faster than

the Ariane 4. The original assumptions made by the designers of the software no longer

held true [26].

8.4.3 AT&T Long-Distance Network

On the afternoon of January 15, 1990, AT&T's long-distance network suffered a signif-

icant disruption of service. About half of the computerized telephone-routing switches

crashed, and the remainder of the switches could not handle all of the traffic. As a result

of this failure, about 70 million long-distance telephone calls could not be put through,

and about 60,000 people lost all telephone service. AT&T lost tens of millions of dol-

lars in revenue. It also lost some of its credibility as a reliable provider of long-distance

service.

Investigation by AT&T engineers revealed that the network crash was brought about

by a single faulty line of code in an error recovery procedure. The system was designed

so that if a server discovered it was in an error state, it would reboot itself, a crude but

effective way of “wiping the slate clean.” After a switch rebooted itself, it would send an

“OK” message to other switches, letting them know it was back online. The software

bug manifested itself when a very busy switch received an “OK” message. Under certain

circumstances, handling the “OK” message would cause the busy switch to enter an error

state and reboot.

On the afternoon of January 15, 1990, a System 7 switch in New York City de-

tected an error condition and rebooted itself (Figure 8.3). When it came back online,

it broadcast an “OK” message. All the switches receiving the “OK” messages handled

them correctly, except three very busy switches in St. Louis, Detroit, and Atlanta. These

switches detected an error condition and rebooted. When they came back up, all of them

broadcast “OK” messages across the network, causing other switches to fail in an ever-

expanding wave.

Every switch failure compounded the problem in two ways. When the switch went

down, it pushed more long-distance traffic onto the other switches, making them busier.

When the switch came back up, it broadcast “OK” messages to these busier switches,

causing some of them to fail. Some switches rebooted repeatedly under the barrage of

“OK” messages. Within 10 minutes, half the switches in the AT&T network had failed.

The crash could have been much worse, but AT&T had converted only 80 of its

network switches to the System 7 software. It had left System 6 software running on 34

of the switches “just in case.” The System 6 switches did not have the software bug and

did not crash [27, 28].

Ethics for the Information Age

Search WWH ::

Custom Search

Home