The Cassandra Architecture - Cassandra: The Definitive Guide

Database Reference

In-Depth Information

Dynamo uses a Merkle tree for anti-entropy (see the definition of Merkle tree in the Glossary ).

Cassandra uses them too, but the implementation is a little different. In Cassandra, each column

family has its own Merkle tree; the tree is created as a snapshot during a major compaction op-

eration (see “Compaction” in the Glossary ) , and is kept only as long as is required to send it to

the neighboring nodes on the ring. The advantage of this implementation is that it reduces disk

I/O.

After each update, the anti-entropy algorithm kicks in. This performs a checksum against the

database and compares checksums of peers; if the checksums differ, then the data is exchanged.

This requires using a time window to ensure that peers have had a chance to receive the most

recent update so that the system is not constantly and unnecessarily executing anti-entropy. To

keep the operation fast, nodes internally maintain an inverted index keyed by timestamp and only

exchange the most recent updates.

In Cassandra, you have multiple nodes that make up your cluster, and one or more of the nodes

act as replicas for a given piece of data. To read data, a client connects to any node in the cluster

and, based on the consistency level specified by the client, a number of nodes are read. The read

operation blocks until the client-specified consistency level is met. If it is detected that some of

the nodes responded with an out-of-date value, Cassandra will return the most recent value to

the client. After returning, Cassandra will perform what's called a read repair in the background.

This operation brings the replicas with stale values up to date.

This design is observed by Cassandra as well as by straight key/value stores such as Project Vol-

demort and Riak. It acts as a performance improvement because the client does not block until

all nodes are read, but the read repair stage manages the task of keeping the data fresh in the

background. If you have lots of clients, it's important to read from a quorum of nodes in order

to ensure that at least one will have the most recent value.

If the client specifies a weak consistency level (such as ONE ), then the read repair is performed in

the background after returning to the client. If you are using one of the two stronger consistency

levels ( QUORUM or ALL ), then the read repair happens beforedata is returned to the client.

If a read operation shows different values stored for the same timestamp, Cassandra will compare

the values directly as a tie-breaking mechanism to ensure that read repairing doesn't enter an in-

finite loop. This case should be exceedingly rare.

Memtables, SSTables, and Commit Logs

When you perform a write operation, it's immediately written to the commit log. The commit

log is a crash-recovery mechanism that supports Cassandra's durability goals. A write will not

count as successful until it's written to the commit log, to ensure that if a write operation does

Search WWH ::

Custom Search

Home