Information Technology Reference
In-Depth Information
Chapter 16. Monitoring Fundamentals
You can observe a lot by just watching.
—Yogi Berra
Monitoring is the primary way we gain visibility into the systems we run. It is the process
of observing information about the state of things for use in both short-term and long-term
decision making. The operational goal of monitoring is to detect the precursors of outages
so they can be fixed before they become actual outages, to collect information that aids de-
cision making in the future, and to detect actual outages. Monitoring is difficult. Organiza-
tions often monitor the wrong things and sometimes do not monitor the important things.
The ideal monitoring system makes the operations team omniscient and omnipresent.
Consideringthathavingtherootpasswordmakesusomnipotent,wearequitetheomniarchs.
Distributed systems are complex. Being omniscient, all knowing, means our monitoring
system should give us the visibility into the system to find out anything we need to know to
do our job. We may not know everything the monitoring system knows, but we can look it
upwhenweneedit.Distributedsystemsaretoolargeforanyonepersontoknoweverything
that is happening.
The large size of distributed systems means we must be omnipresent, existing every-
where at the same time. Monitoring systems permit us to do this even when our systems are
distributed around the world. In a traditional system one could imagine a system adminis-
trator who knows enough about the system to keep an eye on all the critical components.
Whether or not this perception is accurate, we know that in distributed systems it is defin-
itely not true.
Monitoring in distributed computing is different from monitoring in enterprise comput-
ing. Monitoring is not just a system that wakes you up at night when a service or site is
down. Ideally, that should never happen. Choosing a strategy that involves reacting to out-
ages means that we have selected an operational strategy with outages “baked in.” We can
improve how fast we respond to an outage but the outage still happened. That's no way to
run a reliable system.
Search WWH ::




Custom Search