Cassandra Management and Maintenance - Real-time Analytics with Storm and Cassandra

Database Reference

In-Depth Information

Failure scenario handling - detection and recovery

Well, the gossip protocol is Cassandra's own efficient way of knowing when a failure has

occurred; that is, the entire ring gets to know about a downed host through gossip. On a

contrary, situation when a node joins the cluster, the same mechanism is employed to in-

form the all nodes in the ring.

Once Cassandra detects a failure of a nodes on the ring, it stops routing the client requests

to it—failure definitely has some impact on the overall performance of the cluster.

However, it's never a blocker until we have enough replicas for consistency to be served to

the client.

Another interesting fact about gossip is that it happens at various levels—Cassandra gossip,

like real-world gossip, could be secondhand or thirdhand and so on; this is the manifesta-

tion of indirect gossip.

Failure of a node could be actual or virtual. This means that either a node can actually fail

due to system hardware giving away, or the failure could be virtual, wherein, for a while,

network latency is so high that it would seem that the node is not responding. The latter

scenarios, most of the time, are self-recoverable; that is, after a while, networks return to

normalcy, and the nodes are detected in the ring once again. The live nodes keep trying to

ping and gossip with the failed node periodically to see if they are up. If a node is to be de-

clared as permanently departed from the cluster, we require some admin intervention to ex-

plicitly remove the node from the ring.

When a node is joined back to the cluster after quite a while, it's possible that it might have

missed a couple of writes (inserts/updates/deletes), and thus, the data on the node is far

from being accurate as per the latest state of data. It's advisable to run a repair using the

nodetool repair command.

Search WWH ::

Custom Search

Home