How Cassandra Distributes Data - Learning Apache Cassandra

Database Reference

In-Depth Information

Data replication in Cassandra

So far, we've developed a model of distribution in which the total data set is distributed

among multiple machines, but any given piece of data lives on only one machine. This

model carries a big advantage over a single-node configuration, which is that it's horizont-

ally scalable. By distributing data over multiple machines, we can accommodate ever-lar-

ger data sets simply by adding more machines to our cluster.

But our current model doesn't solve the problem of fault-tolerance. No hardware is perfect;

any production deployment must acknowledge that a machine might fail. Our current mod-

el isn't resilient to such failures: for instance, if Node 1 in our original three-node cluster

were to suddenly catch fire, we would lose all the data on that node, including the row con-

taining alice 's user record.

To solve this problem, Cassandra provides replication; in fact, no serious Cassandra de-

ployment would store only one copy of a given piece of data. The number of copies of data

stored is called the replication factor , and it's configured on a per-keyspace level. Recall

the query that we used to create our my_status keyspace in Chapter 1 , Getting Up and

Running with Cassandra :

CREATE KEYSPACE "my_status"

WITH REPLICATION = {

'class': 'SimpleStrategy',

'replication_factor': 1

};

For our development environment, we chose a replication factor of 1; there is little reason

to store multiple copies of the data, since we're only using a single node for development.

In a production deployment, however, we would choose a higher number; 3 is a good de-

fault.

Search WWH ::

Custom Search

Home