Designing for Operations - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

2.1.16 Exception Collection

Whensoftware generates anexception, itshouldbecollected centrally foranalysis. Asoft-

ware exception is an error so severe that the program intentionally exits. For example, the

software author may decide that handling a particular situation is unlikely to happen and

will be difficult to recover from; therefore the program declares an exception and exits in

this situation. Certain data corruption scenarios are better handled by a human than by the

software itself. If you've ever seen an operating system “panic” or present a “blue screen

of death,” that is an exception.

Whendesigningsoftwareforoperability,itiscommontouseasoftwareframeworkthat

detects exceptions, gathers the error message and other information, and submits it to a

centralized database. Such a framework is referred to as an exception collector.

Exception collection systems offer three benefits. First, since most software systems

havesomekindofautomaticrestartcapability,certainexceptionsmaygounnoticed.Ifyou

never see that the exceptions are occurring, of course, you can't deal with the underlying

causes. An exception collector, however, makes the invisible visible.

Second, exception collection helps determine the health of a system. If there are many

exceptions, maintenance such as rolling out new software releases should be cancelled. If

a sharp increase in exceptions is seen during a roll-out, it may be an indication that the re-

lease is bad and the roll-out should stop.

The third benefit from using an exception collector is that the history of exceptions can

be studied for trends. A simple trend to study is whether the sheer volume of exceptions is

going up or down. Usually exception levels can be correlated to a particular software re-

lease. The other trend to look for is repetition. If a particular type of exception is recorded,

the fact that it is happening more or less frequently is telling. If it occurs less frequently,

that means the software quality is improving. If it is increasing in frequency, then there is

the opportunity to detect it and fix the root cause before it becomes a bigger problem.

2.1.17 Documentation for Operations

Developers and operational staff should work together to create a playbook of operating

procedures for the service. A playbook augments the developer-written documentation by

adding operations steps that are informed by the larger business view. For example, the de-

velopers might write the precise steps required to fail over a system to a hot spare. The

playbook would document when such a failover is to be done, who should be notified,

which additional checks must be done before and after failover, and so on. It is critical that

everyprocedureincludeatestsuitethatverifiessuccessorfailure.Followingisanexample

database failover procedure:

Search WWH ::

Custom Search

Home