Monitoring - Practical Cassandra

Database Reference

In-Depth Information

Ports

There are three primary ports of interest to Cassandra: 7000 (or 7001 if SSL/TLS

is enabled), 7199, and 9160. Port 7000/7001 is used by Cassandra for cluster com-

munication. This includes things such as the Gossip protocol and failure detection.

Port 7199 is used by JMX. Port 9160 is the Thrift port and is used for client com-

munication. In order for your cluster to function properly, all of these ports should

be accessible.

While it is not necessary to specifically monitor these ports, it is a good idea

to test them out one way or another. Testing the Thrift port (9160) is just testing

whether you can connect to an instance using a Cassandra driver. In terms of mon-

itoring, if you can connect, the check passes. If you can't connect to the server, the

check should send off an alert. You can also use a simple TCP check here even

though it is less comprehensive.

JMX Checks

Using some of the knowledge we gained from looking at the normal behavior of

our system with JConsole, we are going to add some checks using JMX. There are

plug-ins for Nagios that enable you to run JMX queries and compare the results

against a set of predetermined thresholds. While there are many values that can be

monitored through JMX, there are a few that stand out.

The first set of JMX checks to create is for read and write request latency. These

values are given in microseconds because they should be that small. These laten-

cies can be measured at the Cassandra application level and/or at the ColumnFam-

ily level. Measuring them at the application level is important as a general health

metric. High request latencies can be indicative of a bad disk or that your current

read pattern is starting to slow down. If there is a ColumnFamily for which it is

particularly important to have extremely low-latency reads and/or writes, it would

be a good decision to monitor the performance for that ColumnFamily as well. It

is important to note that read latency and write latency are two separate metrics

provided by Cassandra, and both are important in their own right depending on

your workload.

The next set of JMX metrics to keep tabs on is garbage collection timing. Cas-

sandra will not only tell you how long its last garbage collection took but also how

long that last ParNew GC took. A good way to think of ParNew garbage collec-

tion is that it is a stop-the-world garbage collection that uses multiple GC threads

to complete its job. If you are monitoring the amount of time these take, you can

Search WWH ::

Custom Search

Home