Monitoring Fundamentals - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

This is important because often logs from different systems are compared when debug-

ging or figuring out what went wrong after an outage. For example, when writing a post-

mortem one builds a timeline of events by collecting logs from various machines and ser-

vices, including chat room transcripts, instant message sessions, and email discussions. If

each of these services records timestamps in its local time zone, just figuring out the order

ofwhathappenedcanbeachallenge,especiallyifsuchsystemsdonotindicatewhichtime

zone was being used.

Consolidatinglogsonalargescaleisgenerallyautomatedandtheconsolidationprocess

normalizes all logs to UTC. Unfortunately, configuration mistakes can result in logging

data being normalized incorrectly, something that is not noticed until it is too late. This

problem can be avoided by using UTC for everything.

Google famously timestamps logs using the U.S./Pacific time zone, which caused no

end of frustration for Tom when he worked there. This time zone has a different daylight

savings time calendar than Europe, making log normalization extra complex two weeks

each year,depending ontheyear.Italso means that software must bewritten tounderstand

that one day each year is missing an hour, and another day each year has an extra hour. Le-

gend has it that the U.S./Pacific time zone is used simply because the first Google systems

administratordidnotconsidertheramificationsofhisdecision.Thetimezoneisembedded

so deeply in Google's many systems that there is little hope it will ever change.

16.7 Summary

Monitoring is the primary way we gain visibility into the systems we run. It includes real-

time monitoring, which is used to alert us to exceptional situations that need attention, and

long-term or historic data collection, which facilitates trend analysis. Distributed systems

are complex and require extensive monitoring. No one person can watch over the system

unaided or be expected to intuit what is going on.

The goal of monitoring is to detect problems before they turn into outages, not to detect

outages. If we simply detect outages, then our operating process has downtime “baked in.”

A measurement is a data point. It refers to a single point of data describing an aspect of

asystem,usuallyanumericalvalueorastring.Ametricisameasurementwithanameand

a timestamp.

Decidingwhattomonitorshouldbeginwithatop-downprocess.Identifythebusiness's

key performance indicators (KPIs) and then determine which metrics can be collected to

create those KPIs.

Monitoring is particularly important for distributed systems. By instrumenting systems

and servers and automatically collecting the exposed metrics, we can become the omnisci-

ent, omnipresent, omnipotent system administrators that stakeholders assume we are.

Search WWH ::

Custom Search

Home