Information Technology Reference
In-Depth Information
Detection of such failures, at their very onset, is the key to preventing
cascading failures. There are complex algorithms in distributed computing
under Byzantine failure and Byzantine fault tolerance.
Monitoring Systems
Systems are created to deal with any number of things. Sometimes they
deal with extremely dangerous situations, for example, nuclear reactors
and space shuttles. It is very difficult to test these systems for failures
because the failures in either case would have catastrophic impacts. Thus,
these systems must be run through hypothetical failure scenarios and
recovery mechanisms. Essential components in these systems include
monitoring systems for the detection and reporting of failures, and emer-
gency control functions that will make intelligent decisions by switching
control to safe zones when faults are detected. In some cases this may
even include human intervention.
Software systems should learn from this. Routine checks of the system
should be mandatory. Browsing system logs periodically, even when users
have reported no critical or serious failures, is a good exercise. It is also
helpful to have monitoring software built into all server components to
automatically check the health of the component periodically. It is impor-
tant to remember that detecting failures, on a few server components, can
prevent the spread of those failures to the entire system. Some techniques
used in networking include checksums, parity bits, software interlocks,
watchdog timers, and sample calculations. Sample calculations are bene-
ficial when writing code for some critical function that may or may not
require mathematical operations or multiprocessor systems. It involves
doing the same calculation twice, at different points in time on the same
processor or even building software redundancy by writing multiple
versions of the same algorithm being executed simultaneously and verified
for identical results.
Reliability in Software
Dimitri Kececioglu introduces a formal definition for this:
“Reliability engineering provides the theoretical and practical
tools whereby the probability and capability of parts, compo-
nents, equipment, products and systems to perform their
required functions for desired periods of time without failure,
Search WWH ::




Custom Search