Database Reference
In-Depth Information
System Monitoring
System-monitoring tools perform health checks based on general system statistics. Their purpose is to recognize
irregular load patterns that pop up as well as failures. Even though these tools can monitor the whole infrastructure
at once, it is important to emphasize that they monitor only individual components (for example, hosts, application
servers, databases, or storage subsystems) without considering the interplay between them. As a result, it is difficult,
and for complex infrastructures virtually impossible, to determine the impact on the system response time when a
single component of the infrastructure supporting it experiences an anomaly. An example of this is the high usage of
a particular resource. In other words, an alert coming from a system-monitoring tool is just a warning that something
could be wrong with the application or the infrastructure, but the users may not experience any performance
problems at all (called a false positive ). In contrast, there may be situations where users are experiencing performance
problems, but the system-monitoring tool does not recognize them (called a false negative ). The most common and
simplest cases of false positive and false negative are seen while monitoring the CPU load of SMP systems with a lot
of CPUs. Let's say you have a system with four quad-core CPUs. Whenever you see a utilization of about 75%, you
may think that it is too high; the system is CPU-bounded. However, this load could be very healthy if the number
of running tasks is much greater than the number of cores. This is a false positive. Conversely, whenever you see a
utilization of about 8% of the CPU, you may think that everything is fine. But if the system is running a single task that
is not parallelized, it is possible that the bottleneck for this task is the CPU. In fact, 1/16th of 100% is only 6.25%, and
therefore, a single task cannot burn more than 6.25% of the available CPU. This is a false negative.
Response-Time Monitoring
Response-time monitoring tools (also known as application-monitoring tools ) perform health checks based on either
synthetic transactions that are processed by robots , or on real transactions that are processed by end-users. The tools
measure the time taken by an application to process key transactions, and if the time exceeds an expected threshold
value, they raise an alert. In other words, they exploit the infrastructure as users do, and they complain about poor
performance as users do. Because they probe the application from a user perspective, they are able to not only check
single components but, more importantly, check the whole application's infrastructure as well. For this reason, they
are devoted to monitoring service level agreements.
Compulsive Tuning Disorder
Once upon a time, most database administrators suffered from a disease called compulsive tuning disorder . 3 The signs
of this illness were the excessive checking of many performance-related statistics, most of them ratio-based, and
the inability to focus on what was really important. They simply thought that by applying some “simple” rules, it was
possible to tune their databases. History teaches us that results were not always as good as expected. Why was this the
case? Well, all the rules used to check whether a given ratio (or value) was acceptable were defined independently of
the user experience. In other words, false negatives or positives were the rule and not the exception. Even worse, an
enormous amount of time was spent on these tasks.
For example, from time to time a database administrator will ask me a question like “On one of our databases I
noticed that we have a large amount of waits on latch X. What can I do to reduce or, even better, get rid of such waits?”
My typical answer is “Do your users complain because they are waiting on this specific latch? Of course not. So, do not
worry about it. Instead, ask them what problems they are facing with the application. Then, by analyzing those problems,
you will find out whether the waits on latch X are related to them or not.” I elaborate on this in the next section.
Even though I have never worked as a database administrator, I must admit I suffered from compulsive tuning
disorder as well. Today, I have, like most other people, gotten over this disease. Unfortunately, as with any bad illness,
it takes a very long time to completely vanish. Some people are simply not aware of being infected. Others are aware,
but after many years of addiction, it is always difficult to recognize such a big mistake and break the habit.
 
Search WWH ::




Custom Search