Java Reference
In-Depth Information
The FAA first discovered the problem when it ran tests on the system in the
field. It ran for 49.7 days, and then crashed. After rebooting, everything seemed
fine. When a similar crash happened with another system, the FAA instituted the
30-day reboot procedure.
After the incident in Los Angeles, the issue was tracked down, and a software
patch was created to fix the problem. Now the system periodically resets the
counter without the need for human intervention.
Lessons Learned
In this situation, the problem (if not its implications) was known beforehand.
Harris (the manufacturer) knew about the potential for the timer to expire but
hadn't determined the impact it might have on the system. The FAA discovered
the problem during tests---although not the root cause. Instead of delving further,
they instituted a human-based, manual solution---the ultimate “when in doubt,
reboot” scenario.
It's true that the problem would have been avoided if the FAA procedures had
been followed, but that's of little comfort when the software can make such pro-
cedures unnecessary. It's also true that the incident would have been negligible if
the backup system had not failed. Having redundant backup systems would lessen
the chance of complete failure.
In this case, though, the bottom line is that thorough testing and investigation
would have brought the problem to light. In safety-critical systems, nothing less
should be acceptable.
Source: IEEE Spectrum, November 2004
 
Search WWH ::




Custom Search