Information Technology Reference
In-Depth Information
14.4 Periodic Review of Alerts
The alert log should be reviewed periodically to spot trends and allocate resources to
create long-term fixes that ultimately reduce the total number of alerts received. When this
strategy is implemented, alerts become more than just a way to be made aware of prob-
lems—they become one of your primary vehicles for improving system stability.
Thereshouldbeasystematicapproachtoreducethenumberofalertsorentropyislikely
to make alerts more frequent over time until the volume of alerts spirals out of control.
Alerts shouldalso beanalyzed fortrends because each alert isapotential indicator ofalar-
ger issue. If we aim to improve our system continuously, we must take every opportunity
to seek out clues to where improvements are needed.
It's useful to have a weekly meeting to review alerts and issues and look for trends. At
this meeting, you can identify projects that would fix more common issues as well as take
anoverallsnapshotofthehealthofyourproductionenvironment.Quarterlyreviewscanbe
useful to spot even larger trends and can be folded into quarterly project planning cycles.
The alert log should be annotated by the person who received the alert. Most systems
permit alerts to be tagged with keywords. The keywords can then be analyzed for trends.
Some sample keywords are listed here:
cause:network
cause:human
cause:bug
cause:hardware
cause:knownissue
severity:small
severity:medium
severity:large
bug:BUGID
tick:TICKETNUMBER
machine:HOSTNAME
These keywords enable you to annotate the general cause of the alert, its severity, and
related bug and ticket IDs. Multiple tags can be used. For example, a small outage caused
by a hardware failure might be tagged cause:hardware and severity:small . If
theproblemwascausedbyaknownbug,thealertmightbetagged cause:knownissue
and bug:12345 , assuming the known issue has a bug ID 12345. Items marked as such
would be reserved for situations where there has been a management decision to not fix a
bug or the fix is in progress and a workaround is in place until the fix is delivered.
Search WWH ::




Custom Search