Monitoring - Mastering Apache Cassandra

Database Reference

In-Depth Information

Troubleshooting

We have learned cluster configuration, repairing and scaling, and, finally, monitoring. The

purpose of all this learning is for you to keep production environments up-and-running

smoothly. You may choose the right ingredients to set up a cluster that fits your need, but

there may be node failures, high CPU usage, high memory usage, disk space issues, net-

work failures, and, probably, performance issues with time. You will get most of this in-

formation from the monitoring tool that you have configured. You will need to take the ne-

cessary action, depending on the problems that you are facing.

Usually, one goes about finding these issues via various tools that we've discussed in the

past. You may want to extend the list of tools for investigation to include Linux tooling.

These include netstat and tcpdump for network debugging; vmstat , free , top ,

and dstat for memory statistics; perf , top , dstat , and uptime for CPU statistics;

and iostat , iotop , and df for disk usage.

How do you actually know there is a problem? With a decent monitoring setup and a vigil-

ant system admin, problems usually come to one's knowledge via alerts sent by the monit-

oring system. It may be a mail from OpsCenter, a critical message from Nagios, or a mes-

sage from your home-grown JMX-based monitoring system. Another way to see the issues

is as performance degradation at a certain load. You may find that your application is acting

weird or abnormally slow. You dig into the error and find out that the Cassandra calls are

taking a really long time, more than expected. The other, and scarier, way the problems

come to one's notice is on production. Things have been working decently in the test envir-

onment and you suddenly start seeing frequent garbage collection calls or the production

servers start to scream, "Too many open files."

In many of the error scenarios, the solution is a simple one. For cases such as where AWS

notifies an instance shutdown due to underlying hardware degradation, the fix is to replace

the node with a new one. For a disk full issue, you may add either a new node or just more

hard disks and add the location to the data directory setting in Cassandra—yaml. The fol-

lowing are a few troubleshooting tips. Most of these things you might have known from

previous chapters.

Search WWH ::

Custom Search

Home