How Cassandra Distributes Data - Learning Apache Cassandra

Database Reference

In-Depth Information

Partition keys group data on the same node

In Chapter 3 , Organizing Related Data , you learned that tables with compound primary

keys store all rows sharing the same partition key in contiguous physical storage. This

leads to the observation that querying for ranges of clustering column values within a

single partition key is highly efficient. To perform this sort of lookup, Cassandra need

only locate the beginning of the range on disk, and can then read all the results beginning

at that location. Conversely, querying for rows spanning multiple partition keys requires

an inefficient random disk scan for each partition key being queried.

You new understanding of data partitioning expands this observation: you now know that

querying for multiple partition keys not only requires Cassandra to make multiple disk

scans, but very likely will also require retrieving data from multiple nodes and collating

the results. Cassandra is entirely capable of performing this operation—the process of

reading from multiple nodes and collating the results is performed by a coordinator node

and is entirely transparent to the application. But it's important to remember that the pro-

cess of reading data from multiple partitions—and thus possibly multiple nodes—is ex-

pensive and best avoided for performance-sensitive operations.

Virtual nodes

The model of data distribution we have developed thus far is, in fact, a simplification of

how a modern Cassandra cluster works. While versions of Cassandra prior to 1.2 did dir-

ectly map ranges of tokens onto physical nodes, Cassandra 1.2 introduced virtual nodes ,

which act as an intermediary in the mapping process.

Search WWH ::

Custom Search

Home