Database Reference
In-Depth Information
Troubleshooting
We have learned cluster configuration, repairing and scaling, and, finally, monitoring. The
purpose of all this learning is for you to keep production environments up-and-running
smoothly. You may choose the right ingredients to set up a cluster that fits your need, but
there may be node failures, high CPU usage, high memory usage, disk space issues, net-
work failures, and, probably, performance issues with time. You will get most of this in-
formation from the monitoring tool that you have configured. You will need to take the ne-
cessary action, depending on the problems that you are facing.
Usually, one goes about finding these issues via various tools that we've discussed in the
past. You may want to extend the list of tools for investigation to include Linux tooling.
These include netstat and tcpdump for network debugging; vmstat , free , top ,
and dstat for memory statistics; perf , top , dstat , and uptime for CPU statistics;
and iostat , iotop , and df for disk usage.
How do you actually know there is a problem? With a decent monitoring setup and a vigil-
ant system admin, problems usually come to one's knowledge via alerts sent by the monit-
oring system. It may be a mail from OpsCenter, a critical message from Nagios, or a mes-
sage from your home-grown JMX-based monitoring system. Another way to see the issues
is as performance degradation at a certain load. You may find that your application is acting
weird or abnormally slow. You dig into the error and find out that the Cassandra calls are
taking a really long time, more than expected. The other, and scarier, way the problems
come to one's notice is on production. Things have been working decently in the test envir-
onment and you suddenly start seeing frequent garbage collection calls or the production
servers start to scream, "Too many open files."
In many of the error scenarios, the solution is a simple one. For cases such as where AWS
notifies an instance shutdown due to underlying hardware degradation, the fix is to replace
the node with a new one. For a disk full issue, you may add either a new node or just more
hard disks and add the location to the data directory setting in Cassandra—yaml. The fol-
lowing are a few troubleshooting tips. Most of these things you might have known from
previous chapters.
Search WWH ::




Custom Search