Database Reference
In-Depth Information
easily set up an alert for when they start to take too long. Cassandra is unavailable
during a stop-the-world garbage collection pause. The longer these pauses take,
the longer Cassandra will be unavailable.
Another metric that is useful in helping to determine whether or not you need
to add capacity to your cluster is PendingTasks under the CompactionManager-
MBean. Depending on the speed and volume with which you ingest data, you will
need to find a comfortable set of thresholds for your system. Typically, the number
of PendingTasks should be relatively low, as in fewer than 50 at any given time.
There are certainly acceptable reasons for things to back up, such as forced com-
pactions or cleanup, but it is advisable to watch this metric carefully. If you have
an alert set for PendingTasks and find this alert firing regularly, you may need to
add more capacity (either more or faster disks or more nodes) to your cluster to
keep up with the workload.
The last JMX metrics that should make it onto your first round of monitoring
are the amount of on-heap and the amount of off-heap memory used at a time. The
amount of on-heap memory used should always be less than the amount of heap
that you have allowed the JVM to allocate. Since you know what this value is at
start time, you should be able to easily monitor whether or not you are approaching
that value. Off-heap memory tracking is a little harder to monitor for sane values.
This is a metric where you will once again have to take a look at JConsole and see
what regular and peak values are for the system under normal and peak operation-
al loads so you don't send off useless alerts.
Log Monitoring
There is a lot of useful information in the Cassandra logs that can be indicative of
a problem. As mentioned earlier in the chapter, you can find READ and WRITE
dropped message counts within the INFO log level. There is a Nagios plug-in
that can monitor logs and check for specific log messages. Using this plug-in, you
can have Nagios alert you not just when there are READ and/or WRITE messages
dropped, but you also can have it alert you when this happens more than n times
per period. For instance, your application may be tolerant of missing READ s and
much less tolerant of missing WRITE s. So the log monitoring check can alert you
with a CRITICAL alert if more than 1,000 mutations have been dropped over a
five-minute period and with a WARNING alert if more than 1,000 mutations have
been dropped over a 15-minute period.
Search WWH ::




Custom Search