Monitoring Architecture and Practice - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

generate tickets for problems that are not so urgent as to require immediate attention. See

Section 14.1.7 .

Dashboard systems generally include a template language that generates HTML pages

and a language for describing data graphs. The data graph descriptions are encoded in

URLssotheymaybeincludedasembeddedimagesintheHTMLpages.Forexample,one

URL might specify a graph that compares the ratio of two metrics for the last month for

a particular service. It may specify a histogram of latency for the 10 slowest web servers,

after calculating latency for hundreds of web servers.

Long-term analysis generally examines data collected over large spans of time, often

theentirehistoryofametric,toproducetrenddata.Inmanycases,thisinvolvesgenerating

and storing summaries of data (averages, aggregates, and so on) so that navigating the data

can be done quickly, although at low resolution. Because this type of analysis requires a

large amount of processing, the results are usually stored permanently rather than regener-

ated as needed. Some systems also handle situations where old data is stored on different

media—for example, tape.

Anomaly detection is the determination that a specific measurement is not within ex-

pectations. For example, one might examine all web servers of the same type and detect if

one is generating metrics that are significantly different from the others. This could imply

that the one server is having difficulties that others are not. Anomaly detection finds prob-

lems that you didn't think to monitor for.

Anomaly detection can also be predictive. Mathematical models can be created that use

last year's data to predict what should be happening this year. One can then detect when

this year's data deviates significantly from the prediction. For example, if you can predict

how many QPS are expected from each country, identifying a deviation of more than 10

percent from the prediction might be a good way to detect regional outages or just that an

entire South American country stops to watch a particular sporting event.

Doing anomaly detection in real time and across many systems can be computationally

difficult but systems for doing this are becoming more commonplace.

17.4 Alerting and Escalation Manager

The alerting and escalation component manages the process of communicating to oncall

and other people when exceptional situations are detected. If the person cannot be reached

inacertainamountoftime,thissystemattemptstocontactothers. Section14.1.7 discusses

alerting strategy and various communication technologies.

The first job of the alerting component is to get the attention of the person oncall, or his

or her substitute. The next job is to communicate specific information. The former is usu-

ally done by pager or text message. Since these systems permit only short messages to be

Search WWH ::

Custom Search

Home