Information Technology Reference
In-Depth Information
We will illustrate our points using examples taken from a range of incidents and
accidents, large and small. In doing so, we hope to show how errors do not just
arise because of any inherent error-proneness or maliciousness of the users.
Instead, errors are usually the result of an interaction of several contributing
factors (people, technological, and contextual). Once we accept this state of affairs
we can begin to move away from the need to find someone to blame, and start to
learn
from
erroneous
performance
as
a
way
of
improving
future
system
performance.
10.1.1 What is Error?
Errors are generally regarded as precursors to accidents. The error triggers a set of
events—often referred to as a chain or sequence, although it is not always a linear
set of events—ultimately leading to an outcome that has serious consequences
involving significant loss of life, money, or machinery. Causal analyses of acci-
dents usually highlight the fact that there were many contributory factors. There
are obviously exceptions, where a single catastrophic failure leads directly to an
accident, but generally accidents involve a series of several individually minor
events. This process is sometimes described as a domino effect, or represented by
the Reason's ( 1990 ) Swiss cheese model in which there are holes in the various
layers of the system, and an accident only occurs when the holes line up across all
the layers.
A similar idea is encapsulated in Randell's ( 2000 ) fault-error-failure model that
comes from the field of dependability. A failure is defined as something that occurs
when the service that is delivered is judged to have deviated from its specification.
An error is taken to be the part of the system state that may lead to a subsequent
failure, and the adjudged cause of the error is defined as a fault.
It is very important to note that identifying whether something is a fault, error,
or failure involves making judgments. The fault-error-failure triples can link up so
that you effectively end up with a chain of triples. This is possible because a failure
at one level in the system may constitute a fault at another level. This does not
mean that errors inevitably lead to failures, however. The link between an error
and a failure can be broken either by chance or by taking appropriate design steps
to contain the errors and their effects.
Those errors that have immediate (or near-immediate) effects on system per-
formance are sometimes called active errors (Reason 1990 ). This is to distinguish
them from latent errors, which can lie dormant within a system for some con-
siderable time without having any adverse effect on system performance. The
commission that investigated the nuclear accident at Three Mile Island, for
example, found that an error that had occurred during maintenance (and hence was
latent in the system) led to the emergency feed water system being unavailable
(Kemeny (chairman) 1979 ). Similarly, the vulnerability of the O-ring seals on the
Challenger Space Shuttle was known about beforehand and hence latent in the
Search WWH ::




Custom Search