Monitoring Fundamentals - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

able to read the data collected by the previous system, and a conversion process is unlikely

tobeavailable.Insuchacase,ifyoubuildanewmonitoringsystemeveryfiveorsixyears,

that may be an upper bound. Time-series databases are becoming more standardized and

easier to convert, however, making this upper bound likely to disappear.

Having the ability toretain decades ofmonitoring data at full resolution hasbenefits we

are just beginning to understand. For example, Google's paper “Failure Trends in a Large

Disk Drive Population” ( Pinheiro, Weber & Barroso 2007 ) was able to bust many myths

about hard disk reliability because the authors had access to high-resolution monitoring

data from hundreds of thousands of hard drives' self-monitoring facility (SMART) collec-

ted over five years. Of course, not everyone has seemingly infinite data storage facilities.

Some kind of consolidation or compaction is needed.

The easiest consolidation is to simply delete data that is no longer needed. While ori-

ginally many metrics might be collected, many of them will turn out to be irrelevant or

unnecessary. It is better to collect too much when setting up the system than to wish you

had data that you didn't collect. After you run the service for a while, certain metrics may

be deemed unnecessary or may be useful only in the short term. For example, there may

be specific CPU-related metrics that are useful when debugging current issues but whose

utility expires after a year.

Another way to reduce storage needs is through summarization, or down-sampling.

Withthistechnique,recentdataiskeptatfullfidelitybutolderdataisreplacedbyaverages

or other form of summarization. For example, metrics might be collected at 1- or 5-minute

intervals. When data is more than 13 months old, hourly averages, percentiles, maximums,

and minimums are calculated and the raw data is deleted. When the data is even older, per-

haps 25-37 months, 4-hour or even daily summaries are calculated, reducing the storage

requirementsevenmore.Again,theamountofsummarization onecandodependsonbusi-

ness needs. If you need to know only the approximate bandwidth utilization, daily values

may be sufficient.

16.5 Meta-monitoring

Monitoring the monitoring system is called meta-monitoring . How do you know if the

reason you haven't been alerted today is because everything is fine or because the monit-

oring system has failed? Meta-monitoring detects situations where the monitoring system

itself is the problem.

The monitoring system needs to be more available and scalable than the services being

monitored. Every monitoring system should have some kind of meta-monitoring. Even the

smallestsystemneedsasimplechecktomakesureitisstillrunning.Largersystemsshould

be monitored for the same scale and capacity issues as any other service to prevent disk

Search WWH ::

Custom Search

Home