Information Technology Reference
In-Depth Information
net connection has died, the individual could simply do nothing and in a few minutes the
escalation will happen automatically. However, if the escalation schedule involves paging
the oncall person every 5 minutes and not escalating until three attempts have been made,
this means delaying action for 15 minutes. In this case, the person can NAK and the escal-
ation will happen immediately.
Inevitably, there are alert floods or “pager storms”—situations where dozens or hun-
dreds of alerts are sent at the same time. This is usually due to one network outage that
causes many alert rules to trigger. In most cases, there is a mechanism to suppress depend-
ent alerts automatically, but floods may still happen despite the organization's best efforts.
Forthisreason,analertsystemshouldhavetheabilitytoacknowledgeallalertsatthesame
time. For example, by replying to the text message with the word “ALL” or “STFU,” all
pending alerts for that particular person are acknowledged as well as any alerts received in
the next 5 minutes. The actual alerts can be seen at the alerting dashboard.
Some alert managers have a two-stage acknowledgment. First the oncall person must
acknowledge receiving the alert. This establishes that the person is working on the issue.
Thisdisablesalertsforthatparticularissuewhilepersonnelareworkingonit.Whentheis-
sue is resolved, the oncall person must “resolve” the alert to indicate that the issue is fixed.
The benefit of this system is that it makes it easier to generate metrics about how long it
took to resolve the issue. But what if the person forgets to mark the issue resolved? The
system would need to send out reminder alerts periodically, which defeats the purpose of
having two stages.
For this reason, we feel the two-stage acknowledgment provides little actual value. If
thereisadesiretorecordhowlongittakestoresolveanissue,havethesystemdetectwhen
the alert is no longer triggering. It will be more accurate and less annoying than requiring
the operator to manually indicate that the issue is resolved.
17.4.2 Silence versus Inhibit
Operationally, there is a need to be able to silence an alert at will. For example, during
scheduled maintenance the person doing the maintenance does not need to receive alerts
that a system is down. Systems that depend on that system should not generate alerts be-
cause, in theory, their operations teams have been made aware of the scheduled mainten-
ance.
The mechanism for handling this is called a silence or sometimes a maintenance . A si-
lence is specified as a start time, an end time, and a specification of what to silence. When
specifying what to silence, it can be useful to accept wildcards or regular expressions.
By implementing silences in the alert and escalation system, the alerts still trigger but
no action is taken. This is an important distinction. The alert is still triggering; we're just
Search WWH ::




Custom Search