Monitoring Architecture and Practice - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

net connection has died, the individual could simply do nothing and in a few minutes the

escalation will happen automatically. However, if the escalation schedule involves paging

the oncall person every 5 minutes and not escalating until three attempts have been made,

this means delaying action for 15 minutes. In this case, the person can NAK and the escal-

ation will happen immediately.

Inevitably, there are alert floods or “pager storms”—situations where dozens or hun-

dreds of alerts are sent at the same time. This is usually due to one network outage that

causes many alert rules to trigger. In most cases, there is a mechanism to suppress depend-

ent alerts automatically, but floods may still happen despite the organization's best efforts.

Forthisreason,analertsystemshouldhavetheabilitytoacknowledgeallalertsatthesame

time. For example, by replying to the text message with the word “ALL” or “STFU,” all

pending alerts for that particular person are acknowledged as well as any alerts received in

the next 5 minutes. The actual alerts can be seen at the alerting dashboard.

Some alert managers have a two-stage acknowledgment. First the oncall person must

acknowledge receiving the alert. This establishes that the person is working on the issue.

Thisdisablesalertsforthatparticularissuewhilepersonnelareworkingonit.Whentheis-

sue is resolved, the oncall person must “resolve” the alert to indicate that the issue is fixed.

The benefit of this system is that it makes it easier to generate metrics about how long it

took to resolve the issue. But what if the person forgets to mark the issue resolved? The

system would need to send out reminder alerts periodically, which defeats the purpose of

having two stages.

For this reason, we feel the two-stage acknowledgment provides little actual value. If

thereisadesiretorecordhowlongittakestoresolveanissue,havethesystemdetectwhen

the alert is no longer triggering. It will be more accurate and less annoying than requiring

the operator to manually indicate that the issue is resolved.

17.4.2 Silence versus Inhibit

Operationally, there is a need to be able to silence an alert at will. For example, during

scheduled maintenance the person doing the maintenance does not need to receive alerts

that a system is down. Systems that depend on that system should not generate alerts be-

cause, in theory, their operations teams have been made aware of the scheduled mainten-

ance.

The mechanism for handling this is called a silence or sometimes a maintenance . A si-

lence is specified as a start time, an end time, and a specification of what to silence. When

specifying what to silence, it can be useful to accept wildcards or regular expressions.

By implementing silences in the alert and escalation system, the alerts still trigger but

no action is taken. This is an important distinction. The alert is still triggering; we're just

Search WWH ::

Custom Search

Home