Failure - Building Software: A Practitioner's Guide - page 12

Information Technology Reference

In-Depth Information

Detection of such failures, at their very onset, is the key to preventing

cascading failures. There are complex algorithms in distributed computing

under Byzantine failure and Byzantine fault tolerance.

Monitoring Systems

Systems are created to deal with any number of things. Sometimes they

deal with extremely dangerous situations, for example, nuclear reactors

and space shuttles. It is very difficult to test these systems for failures

because the failures in either case would have catastrophic impacts. Thus,

these systems must be run through hypothetical failure scenarios and

recovery mechanisms. Essential components in these systems include

monitoring systems for the detection and reporting of failures, and emer-

gency control functions that will make intelligent decisions by switching

control to safe zones when faults are detected. In some cases this may

even include human intervention.

Software systems should learn from this. Routine checks of the system

should be mandatory. Browsing system logs periodically, even when users

have reported no critical or serious failures, is a good exercise. It is also

helpful to have monitoring software built into all server components to

automatically check the health of the component periodically. It is impor-

tant to remember that detecting failures, on a few server components, can

prevent the spread of those failures to the entire system. Some techniques

used in networking include checksums, parity bits, software interlocks,

watchdog timers, and sample calculations. Sample calculations are bene-

ficial when writing code for some critical function that may or may not

require mathematical operations or multiprocessor systems. It involves

doing the same calculation twice, at different points in time on the same

processor or even building software redundancy by writing multiple

versions of the same algorithm being executed simultaneously and verified

for identical results.

Reliability in Software

Dimitri Kececioglu introduces a formal definition for this:

“Reliability engineering provides the theoretical and practical

tools whereby the probability and capability of parts, compo-

nents, equipment, products and systems to perform their

required functions for desired periods of time without failure,

Next Page

Building Software: A Practitioner's Guide

Search WWH ::

Custom Search

Home