Database Reference
In-Depth Information
Chapter 10. How Cassandra Distributes
Data
Much of Cassandra's power lies in the fact that it is a distributed database : rather than
storing all of your data on a single machine, it is designed to distribute data across multiple
machines. A distributed architecture is hugely beneficial for scalability since you're not
bound by the hardware capacity of a single machine; if you need more storage or more pro-
cessing power, you can simply add more nodes to your Cassandra cluster. It's also a boon
for availability: by storing multiple copies of your data on multiple machines, Cassandra is
resilient to the failure of a particular node.
The beauty of a distributed database such as Cassandra is that, as application developers,
we rarely need to think about the fact that we're working with data that's spread across mul-
tiple servers. We've spent the last nine chapters exploring a wide range of Cassandra's func-
tionality, and the interfaces we've worked with never require us to explicitly account for the
fact that data is distributed. From the application's perspective, we simply write data to
Cassandra and then read it back; the database takes care of figuring out which machine or
machines the data is written to or read from.
That said, when developing applications using Cassandra as a persistence layer, it's import-
ant to understand how data is distributed and replicated. One topic of keen interest to ap-
plication developers is consistency : if multiple copies of a piece of data exist on different
machines in the cluster, how do I know that I'm reading the most up-to-date version of the
data? In distributed data stores, there is always a tradeoff between consistency and availab-
ility; Cassandra provides tunable consistency, which allows us to decide which is more im-
portant in any given scenario.
Another important consideration in a distributed data store is conflict resolution: if two cli-
ents attempt to write different data to the same location at the same time, which write wins?
Under the hood, every piece of data written to Cassandra has a timestamp attached to it;
conflicts are resolved with a simple last-write-wins strategy. Application developers have
the ability to override the default timestamp attached to a write operation, sometimes to in-
teresting effect.
Finally, a fault-tolerant distributed database needs to take special care when deleting data.
In particular, a deletion does not result in Cassandra immediately forgetting that a value
ever existed; doing so might lead to the data unexpectedly reappearing in certain scenarios,
Search WWH ::




Custom Search