Database Reference
In-Depth Information
Data replication in Cassandra
So far, we've developed a model of distribution in which the total data set is distributed
among multiple machines, but any given piece of data lives on only one machine. This
model carries a big advantage over a single-node configuration, which is that it's horizont-
ally scalable. By distributing data over multiple machines, we can accommodate ever-lar-
ger data sets simply by adding more machines to our cluster.
But our current model doesn't solve the problem of fault-tolerance. No hardware is perfect;
any production deployment must acknowledge that a machine might fail. Our current mod-
el isn't resilient to such failures: for instance, if
Node 1
in our original three-node cluster
were to suddenly catch fire, we would lose all the data on that node, including the row con-
taining
alice
's user record.
To solve this problem, Cassandra provides replication; in fact, no serious Cassandra de-
ployment would store only one copy of a given piece of data. The number of copies of data
stored is called the
replication factor
, and it's configured on a per-keyspace level. Recall
Running with Cassandra
:
CREATE KEYSPACE "my_status"
WITH REPLICATION = {
'class': 'SimpleStrategy',
'replication_factor': 1
};
For our development environment, we chose a replication factor of 1; there is little reason
to store multiple copies of the data, since we're only using a single node for development.
In a production deployment, however, we would choose a higher number; 3 is a good de-
fault.