Deployment and Provisioning - Practical Cassandra

Database Reference

In-Depth Information

Virtual Nodes

There are many advantages to using vnodes over a setup in which each node has

a single token. It should be noted that you can easily have Cassandra follow the

old single-token paradigm by setting num_tokens to 1 instead of a larger setting

like 256. If you do that, you won't be able to take advantage of more efficient node

bootstrapping. For example, let's say you have a 20-node cluster that has a replic-

ation factor of 3, and you lose a node and have to bring up a replacement. Without

vnodes, you will be guaranteed to touch and likely saturate three nodes in your

cluster (or about 15%). Since the RF is 3 and you are down one node, two replicas

times three ranges (assuming that is the breakdown) means six nodes available to

stream or bootstrap from. That's the theory anyway. In practice, Cassandra will

use only three nodes to rebuild the data. So out of your 20-node cluster, you will

have one node down for a rebuild and three nodes streaming to rebuild it. Effect-

ively 20% of your cluster is degraded. This is all assuming nothing else happens,

such as losing another node while rebuilding the current one.

Enter vnodes. Instead of only being able to stream from other nodes in the rep-

lica set that was lost, the data is spread around the cluster and can be streamed

from many nodes. When it comes to repairs, the way it works is that a validation

compaction is run to see what data needs to be streamed to the node being repaired.

Once the validation compaction is done, the node creates a Merkle tree and sends

the data from that tree to the requesting node. Of the two operations, streaming is

the faster one, whereas the validation can take a long time depending on the size

of the data. With smaller distributions of data, the validations on each node can be

completed more quickly. Since the streaming is fast anyway, the entire process is

sped up because there are more nodes handling less data and providing it to the

requesting node more quickly.

Another huge advantage is the fact that replacing or adding nodes can be done

individually instead of doing full capacity planning. Before virtual nodes, if you

had a 12-node cluster, you could not add just one node without rebalancing. This

would leave the cluster in a state where there is an uneven distribution of data.

Among other things, this can create hot spots, which is bad for the cluster. With

virtual nodes, you can add a single cluster and the shuffle method will redis-

tribute data to ensure that everything is properly balanced throughout the cluster.

If you are using older commodity machines or slightly slower machines, setting

the num_tokens field to something smaller than the default of 256 is probably

the way to go. Starting with the default of 256 is usually fine.

Search WWH ::

Custom Search

Home