Information Technology Reference
In-Depth Information
able to read the data collected by the previous system, and a conversion process is unlikely
tobeavailable.Insuchacase,ifyoubuildanewmonitoringsystemeveryfiveorsixyears,
that may be an upper bound. Time-series databases are becoming more standardized and
easier to convert, however, making this upper bound likely to disappear.
Having the ability toretain decades ofmonitoring data at full resolution hasbenefits we
are just beginning to understand. For example, Google's paper “Failure Trends in a Large
Disk Drive Population” ( Pinheiro, Weber & Barroso 2007 ) was able to bust many myths
about hard disk reliability because the authors had access to high-resolution monitoring
data from hundreds of thousands of hard drives' self-monitoring facility (SMART) collec-
ted over five years. Of course, not everyone has seemingly infinite data storage facilities.
Some kind of consolidation or compaction is needed.
The easiest consolidation is to simply delete data that is no longer needed. While ori-
ginally many metrics might be collected, many of them will turn out to be irrelevant or
unnecessary. It is better to collect too much when setting up the system than to wish you
had data that you didn't collect. After you run the service for a while, certain metrics may
be deemed unnecessary or may be useful only in the short term. For example, there may
be specific CPU-related metrics that are useful when debugging current issues but whose
utility expires after a year.
Another way to reduce storage needs is through summarization, or down-sampling.
Withthistechnique,recentdataiskeptatfullfidelitybutolderdataisreplacedbyaverages
or other form of summarization. For example, metrics might be collected at 1- or 5-minute
intervals. When data is more than 13 months old, hourly averages, percentiles, maximums,
and minimums are calculated and the raw data is deleted. When the data is even older, per-
haps 25-37 months, 4-hour or even daily summaries are calculated, reducing the storage
requirementsevenmore.Again,theamountofsummarization onecandodependsonbusi-
ness needs. If you need to know only the approximate bandwidth utilization, daily values
may be sufficient.
16.5 Meta-monitoring
Monitoring the monitoring system is called meta-monitoring . How do you know if the
reason you haven't been alerted today is because everything is fine or because the monit-
oring system has failed? Meta-monitoring detects situations where the monitoring system
itself is the problem.
The monitoring system needs to be more available and scalable than the services being
monitored. Every monitoring system should have some kind of meta-monitoring. Even the
smallestsystemneedsasimplechecktomakesureitisstillrunning.Largersystemsshould
be monitored for the same scale and capacity issues as any other service to prevent disk
Search WWH ::




Custom Search