Information Technology Reference
In-Depth Information
Field-length problems are endemic in software, and they always seem to escape
notice until just before they actually happen.
Problem avoidance: The Y2K problem could have been found by almost any
method, including inspections, static analysis, pair programming, and testing, ex-
cept that two-digit dates were considered to be valid. For more than 30 years, the
two-digit dates were not regarded as erroneous, so nobody wrote test cases to find
them.
Starting in about 1995, this situation changed and not only did testing begin to
look for short dates, but a number of specialized Y2K tools were built to ferret
them out in legacy applications.
Although Y2K itself is now behind us, the problem of not having enough
spaces for numeric information is one of the most common problems in the history
of software.
2004: Shutdown of Los Angeles Airport (LAX) Air-Traffic Controls
On Tuesday, September 14, 2004, near 5 P.M ., the air-traffic controllers at LAX
lost voice contact with about 400 in-flight planes. Radar screens also stopped
working. A total of about 800 flights were affected and had to be diverted. This
was a very serious problem. A backup system failed about one minute after being
activated. The system was out of service for around three-and-a-half hours.
The apparent cause of this problem was an internal counter that counts down
from about four billion and then needs to be reset. The counter was used to send
messages to system components at fixed intervals. Usually, it takes about fifty
days to reach zero. Normally, the counter was reset after thirty days, but appar-
ently that did not happen. The servers in use were from Microsoft. Apparently, a
scheduled reset was missed by an employee who was not fully trained.
Lessons learned: The obvious lesson is that complex systems that require human
intervention to keep running will eventually fail. Several kinds of automated resets
could have been designed, or control could have been passed to backup servers
with different reset intervals.
Problem avoidance: This was a combination of human error and a questionable
design in the servers that required manual resets. QFD might have prevented the
problem. Design inspections would certainly have found the problem. Neither pair
programming nor static analysis would have identified this because of the mix of
humans and software.
Search WWH ::




Custom Search