Information Technology Reference
In-Depth Information
generate tickets for problems that are not so urgent as to require immediate attention. See
Section 14.1.7 .
Dashboard systems generally include a template language that generates HTML pages
and a language for describing data graphs. The data graph descriptions are encoded in
URLssotheymaybeincludedasembeddedimagesintheHTMLpages.Forexample,one
URL might specify a graph that compares the ratio of two metrics for the last month for
a particular service. It may specify a histogram of latency for the 10 slowest web servers,
after calculating latency for hundreds of web servers.
Long-term analysis generally examines data collected over large spans of time, often
theentirehistoryofametric,toproducetrenddata.Inmanycases,thisinvolvesgenerating
and storing summaries of data (averages, aggregates, and so on) so that navigating the data
can be done quickly, although at low resolution. Because this type of analysis requires a
large amount of processing, the results are usually stored permanently rather than regener-
ated as needed. Some systems also handle situations where old data is stored on different
media—for example, tape.
Anomaly detection is the determination that a specific measurement is not within ex-
pectations. For example, one might examine all web servers of the same type and detect if
one is generating metrics that are significantly different from the others. This could imply
that the one server is having difficulties that others are not. Anomaly detection finds prob-
lems that you didn't think to monitor for.
Anomaly detection can also be predictive. Mathematical models can be created that use
last year's data to predict what should be happening this year. One can then detect when
this year's data deviates significantly from the prediction. For example, if you can predict
how many QPS are expected from each country, identifying a deviation of more than 10
percent from the prediction might be a good way to detect regional outages or just that an
entire South American country stops to watch a particular sporting event.
Doing anomaly detection in real time and across many systems can be computationally
difficult but systems for doing this are becoming more commonplace.
17.4 Alerting and Escalation Manager
The alerting and escalation component manages the process of communicating to oncall
and other people when exceptional situations are detected. If the person cannot be reached
inacertainamountoftime,thissystemattemptstocontactothers. Section14.1.7 discusses
alerting strategy and various communication technologies.
The first job of the alerting component is to get the attention of the person oncall, or his
or her substitute. The next job is to communicate specific information. The former is usu-
ally done by pager or text message. Since these systems permit only short messages to be
Search WWH ::




Custom Search