Database Reference
In-Depth Information
Failure scenario handling - detection and recovery
Well, the gossip protocol is Cassandra's own efficient way of knowing when a failure has
occurred; that is, the entire ring gets to know about a downed host through gossip. On a
contrary, situation when a node joins the cluster, the same mechanism is employed to in-
form the all nodes in the ring.
Once Cassandra detects a failure of a nodes on the ring, it stops routing the client requests
to it—failure definitely has some impact on the overall performance of the cluster.
However, it's never a blocker until we have enough replicas for consistency to be served to
the client.
Another interesting fact about gossip is that it happens at various levels—Cassandra gossip,
like real-world gossip, could be secondhand or thirdhand and so on; this is the manifesta-
tion of indirect gossip.
Failure of a node could be actual or virtual. This means that either a node can actually fail
due to system hardware giving away, or the failure could be virtual, wherein, for a while,
network latency is so high that it would seem that the node is not responding. The latter
scenarios, most of the time, are self-recoverable; that is, after a while, networks return to
normalcy, and the nodes are detected in the ring once again. The live nodes keep trying to
ping and gossip with the failed node periodically to see if they are up. If a node is to be de-
clared as permanently departed from the cluster, we require some admin intervention to ex-
plicitly remove the node from the ring.
When a node is joined back to the cluster after quite a while, it's possible that it might have
missed a couple of writes (inserts/updates/deletes), and thus, the data on the node is far
from being accurate as per the latest state of data. It's advisable to run a repair using the
nodetool repair command.
Search WWH ::




Custom Search