Information Technology Reference
In-Depth Information
raised. Unfortunately, there was no exception-handling mechanism for this particular
exception, so the onboard computers crashed.
The faulty piece of code had been part of the software for the Ariane 4. The 64-
bit floating-point value represented the horizontal bias of the launch vehicle, which is
related to its horizontal velocity. When the software module was designed, engineers
determined that it would be impossible for the horizontal bias to be so large that it could
not be stored in a 16-bit signed integer. There was no need for an error handler, because
an error could not occur. This code was moved “as is” into the software for the Ariane
5. That proved to be an extremely costly mistake, because the Ariane 5 was faster than
the Ariane 4. The original assumptions made by the designers of the software no longer
held true [26].
8.4.3 AT&T Long-Distance Network
On the afternoon of January 15, 1990, AT&T's long-distance network suffered a signif-
icant disruption of service. About half of the computerized telephone-routing switches
crashed, and the remainder of the switches could not handle all of the traffic. As a result
of this failure, about 70 million long-distance telephone calls could not be put through,
and about 60,000 people lost all telephone service. AT&T lost tens of millions of dol-
lars in revenue. It also lost some of its credibility as a reliable provider of long-distance
service.
Investigation by AT&T engineers revealed that the network crash was brought about
by a single faulty line of code in an error recovery procedure. The system was designed
so that if a server discovered it was in an error state, it would reboot itself, a crude but
effective way of “wiping the slate clean.” After a switch rebooted itself, it would send an
“OK” message to other switches, letting them know it was back online. The software
bug manifested itself when a very busy switch received an “OK” message. Under certain
circumstances, handling the “OK” message would cause the busy switch to enter an error
state and reboot.
On the afternoon of January 15, 1990, a System 7 switch in New York City de-
tected an error condition and rebooted itself (Figure 8.3). When it came back online,
it broadcast an “OK” message. All the switches receiving the “OK” messages handled
them correctly, except three very busy switches in St. Louis, Detroit, and Atlanta. These
switches detected an error condition and rebooted. When they came back up, all of them
broadcast “OK” messages across the network, causing other switches to fail in an ever-
expanding wave.
Every switch failure compounded the problem in two ways. When the switch went
down, it pushed more long-distance traffic onto the other switches, making them busier.
When the switch came back up, it broadcast “OK” messages to these busier switches,
causing some of them to fail. Some switches rebooted repeatedly under the barrage of
“OK” messages. Within 10 minutes, half the switches in the AT&T network had failed.
The crash could have been much worse, but AT&T had converted only 80 of its
network switches to the System 7 software. It had left System 6 software running on 34
of the switches “just in case.” The System 6 switches did not have the software bug and
did not crash [27, 28].
 
 
Search WWH ::




Custom Search