Information Technology Reference
In-Depth Information
2.1.16 Exception Collection
Whensoftware generates anexception, itshouldbecollected centrally foranalysis. Asoft-
ware exception is an error so severe that the program intentionally exits. For example, the
software author may decide that handling a particular situation is unlikely to happen and
will be difficult to recover from; therefore the program declares an exception and exits in
this situation. Certain data corruption scenarios are better handled by a human than by the
software itself. If you've ever seen an operating system “panic” or present a “blue screen
of death,” that is an exception.
Whendesigningsoftwareforoperability,itiscommontouseasoftwareframeworkthat
detects exceptions, gathers the error message and other information, and submits it to a
centralized database. Such a framework is referred to as an exception collector.
Exception collection systems offer three benefits. First, since most software systems
havesomekindofautomaticrestartcapability,certainexceptionsmaygounnoticed.Ifyou
never see that the exceptions are occurring, of course, you can't deal with the underlying
causes. An exception collector, however, makes the invisible visible.
Second, exception collection helps determine the health of a system. If there are many
exceptions, maintenance such as rolling out new software releases should be cancelled. If
a sharp increase in exceptions is seen during a roll-out, it may be an indication that the re-
lease is bad and the roll-out should stop.
The third benefit from using an exception collector is that the history of exceptions can
be studied for trends. A simple trend to study is whether the sheer volume of exceptions is
going up or down. Usually exception levels can be correlated to a particular software re-
lease. The other trend to look for is repetition. If a particular type of exception is recorded,
the fact that it is happening more or less frequently is telling. If it occurs less frequently,
that means the software quality is improving. If it is increasing in frequency, then there is
the opportunity to detect it and fix the root cause before it becomes a bigger problem.
2.1.17 Documentation for Operations
Developers and operational staff should work together to create a playbook of operating
procedures for the service. A playbook augments the developer-written documentation by
adding operations steps that are informed by the larger business view. For example, the de-
velopers might write the precise steps required to fail over a system to a hot spare. The
playbook would document when such a failover is to be done, who should be notified,
which additional checks must be done before and after failover, and so on. It is critical that
everyprocedureincludeatestsuitethatverifiessuccessorfailure.Followingisanexample
database failover procedure:
Search WWH ::




Custom Search