Information Technology Reference
In-Depth Information
3
Addressing Component Failures
In a critical system -or any system in which it is important to limit the pos-
sible damage to the equipment- all assumptions must be systematically ques-
tioned. Potential faults must be identified and the software must deal with them
appropriately.
It is pointed out in Section 1.4 that it is desirable to layer a specification by
separating the behaviour under different sets of assumptions: the most optimistic
(no faults in external components) through to minimal behaviour which might
involve setting off alarms.
One way to undertake such a division is to treat the separate systems as
different problems and to look at their combination with programming combi-
nators. In the world of “normal design” such decompositions might be standard
and the choice of components be so accepted that one could indeed just use the
techniques presented so far to specify the individual problems.
Computer technology has however developed so fast that many problems fall
into the “radical design” category. We should in any case like to be able to
deduce properties of an overall system. The source of the diculty with which
we have struggled is the continuous time specifications which our applications
have forced us to employ. It is not dicult to describe normal behaviour as in
Section 2; describing fault-tolerant behaviour uses similar notation plus the ideas
in this section. The key issue is how to describe the handover between the normal
and fault-tolerant phases of operation. Our ideas for this will appear elsewhere
but an indication of the approach is given in Section 4.3.
3.1
Faults in the Sluice Gate System
In our treatment of the sluice gate example so far, we have focused on the
situation where all of the (physical) components operate faultlessly. We now
consider what sorts of issues arise when trying to cope with component failure.
In the sluice gate problem, components like sensors can fail; for example, they
can become stuck false or they can become stuck true. Moreover, the motor
could burn out and no longer be able to move the gate when power is applied to
it. Such component failures are faults in the larger system and a useful control
program will limit their impact even if it cannot meet the original requirements.
In [Jac00] this obligation is called the reliability concern . If a faulty component
is detected, the Control Machine should, perhaps, switch off the motor and turn
on an alarm to indicate that the system needs attention from the maintenance
engineer and that the irrigation requirement is no longer being satisfied.
It will become clear that it is more dicult to maintain our isolation from
details of the physical world when we examine fault-tolerance but we will examine
ways in which such considerations can be brought in gradually.
It would be possible to follow the method described above with weaker as-
sumptions about the physical components (and additional requirements with
respect to alarms) but the resulting specification might become opaque because
it would lack structure. One would like to achieve a structure which preserved
Search WWH ::




Custom Search