Database Reference
In-Depth Information
Data distribution in Cassandra
In a traditional relational database such as MySQL or PostgreSQL, the entire contents of
the database reside on a single machine. At a certain scale, the hardware capacity of the
server running the database becomes a constraint: simply migrating to more powerful hard-
ware will lead to diminishing returns.
Let's imagine ourselves in this scenario, where we have an application running on a single-
machine database that has reached the limits of its capacity to vertically scale. In that case,
we'll want to split the data between multiple machines, a process known as sharding or
federation . Assuming we want to stick with the same underlying tool, we'll end up with
multiple database instances, each of which holds a subset of our total data. Crucially, in this
scenario, the different database instances have no knowledge of each other; as far as each
instance is concerned, it's simply a standalone database containing a standalone dataset.
It's up to our application to manage the distribution of data among our various database in-
stances. Specifically, for any given piece of data, we'll need to know which instance it be-
longs on; when we're writing data, we need to write to the right instance, and when we're
reading data, we need to read it back from that same place.
Assuming we are using integer primary keys—the standard practice in relational data-
bases—one simple strategy is to use the modulus of the primary key. If we have four data-
base instances, our primary key layout would look something like this:
Instance
Primary Keys
Instance 1 1, 5, 9, 13, …
Instance 2 2, 6, 10, 14, …
Instance 3 3, 7, 11, 15, …
Instance 4 4, 8, 12, 16, …
Of course, this strategy would likely result in great difficulty performing relational opera-
tions; it's quite likely that a foreign key stored on one instance would refer to a primary key
stored on a different instance. For this reason, in a real application we would likely design a
Search WWH ::




Custom Search