Information Technology Reference
In-Depth Information
This is important because often logs from different systems are compared when debug-
ging or figuring out what went wrong after an outage. For example, when writing a post-
mortem one builds a timeline of events by collecting logs from various machines and ser-
vices, including chat room transcripts, instant message sessions, and email discussions. If
each of these services records timestamps in its local time zone, just figuring out the order
ofwhathappenedcanbeachallenge,especiallyifsuchsystemsdonotindicatewhichtime
zone was being used.
Consolidatinglogsonalargescaleisgenerallyautomatedandtheconsolidationprocess
normalizes all logs to UTC. Unfortunately, configuration mistakes can result in logging
data being normalized incorrectly, something that is not noticed until it is too late. This
problem can be avoided by using UTC for everything.
Google famously timestamps logs using the U.S./Pacific time zone, which caused no
end of frustration for Tom when he worked there. This time zone has a different daylight
savings time calendar than Europe, making log normalization extra complex two weeks
each year,depending ontheyear.Italso means that software must bewritten tounderstand
that one day each year is missing an hour, and another day each year has an extra hour. Le-
gend has it that the U.S./Pacific time zone is used simply because the first Google systems
administratordidnotconsidertheramificationsofhisdecision.Thetimezoneisembedded
so deeply in Google's many systems that there is little hope it will ever change.
16.7 Summary
Monitoring is the primary way we gain visibility into the systems we run. It includes real-
time monitoring, which is used to alert us to exceptional situations that need attention, and
long-term or historic data collection, which facilitates trend analysis. Distributed systems
are complex and require extensive monitoring. No one person can watch over the system
unaided or be expected to intuit what is going on.
The goal of monitoring is to detect problems before they turn into outages, not to detect
outages. If we simply detect outages, then our operating process has downtime “baked in.”
A measurement is a data point. It refers to a single point of data describing an aspect of
asystem,usuallyanumericalvalueorastring.Ametricisameasurementwithanameand
a timestamp.
Decidingwhattomonitorshouldbeginwithatop-downprocess.Identifythebusiness's
key performance indicators (KPIs) and then determine which metrics can be collected to
create those KPIs.
Monitoring is particularly important for distributed systems. By instrumenting systems
and servers and automatically collecting the exposed metrics, we can become the omnisci-
ent, omnipresent, omnipotent system administrators that stakeholders assume we are.
Search WWH ::




Custom Search