How Cassandra Distributes Data - Learning Apache Cassandra

Database Reference

In-Depth Information

Masterless replication

If you've worked with a relational database in production, it's likely you have experience

with replication. Relational databases typically provide master-follower replication , in

which all data is written to a single master instance; then, behind the scenes, the writes are

replicated to follower instances. The application can read data from any of the followers.

Note that master-follower databases are not distributed: every machine has a full copy of

the dataset. Master-follower replication is great for scaling up the processing power avail-

able for handling read requests, but does nothing to accommodate arbitrarily large datasets.

Master-follower replication also provides some resilience against machine failure: in par-

ticular, failure of a machine will not result in data loss, since other machines have a full

copy of the same dataset.

However, a master-follower architecture cannot guarantee full availability in the case of

hardware failure. In particular, if the master instance fails, the application will be unable to

write any data until the master is restored, or one of the followers is promoted to become

the new master. The process of promoting a new master can be automated using built-in

database features or third-party tools, but there will still be some downtime during which

the application cannot write data.

Replication without a master

Cassandra solves this problem by simply removing the master instance from the picture. In

Cassandra, when a piece of data is written, the write is sent to all of the nodes that should

hold a copy of that data; no single node is authoritative. This neatly solves the availability

problem: with no master instance, there is no single point of failure. If a node becomes un-

available, the data intended for it is still written to the other nodes that should store it; the

application need not halt writing data.

Note

In fact, Cassandra is even more robust when a node is unavailable to receive a write.

Through a process called hinted handoff , other nodes in the cluster will store information

about the write request, and then replay that request to the missing node when it becomes

available again.

Returning to our model of Cassandra replication from the previous section, we can now ex-

pand it to account for replication. In particular, each virtual node is in fact stored on mul-

Search WWH ::

Custom Search

Home