Database Reference
In-Depth Information
Dynamo uses a Merkle tree for anti-entropy (see the definition of Merkle tree in the Glossary ).
Cassandra uses them too, but the implementation is a little different. In Cassandra, each column
family has its own Merkle tree; the tree is created as a snapshot during a major compaction op-
eration (see “Compaction” in the Glossary ) , and is kept only as long as is required to send it to
the neighboring nodes on the ring. The advantage of this implementation is that it reduces disk
I/O.
After each update, the anti-entropy algorithm kicks in. This performs a checksum against the
database and compares checksums of peers; if the checksums differ, then the data is exchanged.
This requires using a time window to ensure that peers have had a chance to receive the most
recent update so that the system is not constantly and unnecessarily executing anti-entropy. To
keep the operation fast, nodes internally maintain an inverted index keyed by timestamp and only
exchange the most recent updates.
In Cassandra, you have multiple nodes that make up your cluster, and one or more of the nodes
act as replicas for a given piece of data. To read data, a client connects to any node in the cluster
and, based on the consistency level specified by the client, a number of nodes are read. The read
operation blocks until the client-specified consistency level is met. If it is detected that some of
the nodes responded with an out-of-date value, Cassandra will return the most recent value to
the client. After returning, Cassandra will perform what's called a read repair in the background.
This operation brings the replicas with stale values up to date.
This design is observed by Cassandra as well as by straight key/value stores such as Project Vol-
demort and Riak. It acts as a performance improvement because the client does not block until
all nodes are read, but the read repair stage manages the task of keeping the data fresh in the
background. If you have lots of clients, it's important to read from a quorum of nodes in order
to ensure that at least one will have the most recent value.
If the client specifies a weak consistency level (such as ONE ), then the read repair is performed in
the background after returning to the client. If you are using one of the two stronger consistency
levels ( QUORUM or ALL ), then the read repair happens beforedata is returned to the client.
If a read operation shows different values stored for the same timestamp, Cassandra will compare
the values directly as a tie-breaking mechanism to ensure that read repairing doesn't enter an in-
finite loop. This case should be exceedingly rare.
Memtables, SSTables, and Commit Logs
When you perform a write operation, it's immediately written to the commit log. The commit
log is a crash-recovery mechanism that supports Cassandra's durability goals. A write will not
count as successful until it's written to the commit log, to ensure that if a write operation does
Search WWH ::




Custom Search