Arrays - Java Software Solutions: Foundations of Program Design

Java Reference

In-Depth Information

The FAA first discovered the problem when it ran tests on the system in the

field. It ran for 49.7 days, and then crashed. After rebooting, everything seemed

fine. When a similar crash happened with another system, the FAA instituted the

30-day reboot procedure.

After the incident in Los Angeles, the issue was tracked down, and a software

patch was created to fix the problem. Now the system periodically resets the

counter without the need for human intervention.

Lessons Learned

In this situation, the problem (if not its implications) was known beforehand.

Harris (the manufacturer) knew about the potential for the timer to expire but

hadn't determined the impact it might have on the system. The FAA discovered

the problem during tests---although not the root cause. Instead of delving further,

they instituted a human-based, manual solution---the ultimate “when in doubt,

reboot” scenario.

It's true that the problem would have been avoided if the FAA procedures had

been followed, but that's of little comfort when the software can make such pro-

cedures unnecessary. It's also true that the incident would have been negligible if

the backup system had not failed. Having redundant backup systems would lessen

the chance of complete failure.

In this case, though, the bottom line is that thorough testing and investigation

would have brought the problem to light. In safety-critical systems, nothing less

should be acceptable.

Source: IEEE Spectrum, November 2004

Search WWH ::

Custom Search

Home