Surveying Database Internals - Professional NoSQL

Databases Reference

In-Depth Information

idea similar to the concept of human gossip. In the case of gossip a peer arbitrarily chooses to send

messages to other nodes. In Cassandra, gossip is more systematic and is triggered by a Gossiper

class that runs on the basis of a timer. Nodes register themselves with the Gossiper class and

receive updates as gossip propagates through the network. Gossip is meant for large distributed

systems and is not particularly reliable. In Cassandra, the Gossiper class keeps track of nodes as

gossip spreads through them.

In terms of the workfl ow, every timer-driven Gossiper action requires the Gossiper to choose a

random node and send that node a message. This message is named GossipDigestSyncMessage.

The receiving node, if active, sends an acknowledgment back to the Gossiper. To complete gossip,

the Gossiper sends an acknowledgment in response to the acknowledgment it receives. If the

communication completes all steps, gossip successfully shares the state information between the

Gossiper and the node. If during gossip the communication fails, it indicates that possibly the node

may be down.

To detect failure, Cassandra uses an algorithm called the Phi Accrual Failure Detection. This

method of detection converts the binary spectrum of node alive or node dead to a level in the middle

that indicates the suspicion level. The traditional idea of failure detection via periodic heartbeats is

therefore replaced with a continuous assessment of suspicion levels.

Whereas gossip keeps the nodes in sync and repairs any temporary damages, more severe damages

are identifi ed and repaired via an anti-entropy mechanism. In this process, data in a column-family

is converted to a hash using the Merkle tree. The Merkle tree representations compare data between

neighboring nodes. If there is a discrepancy, the nodes are reconciled and repaired. The Merkle tree

is created as a snapshot during a major compaction operation.

This reconfi rms that the weak consistency in Cassandra may require reading from a Quorum to

avoid inconsistencies.

Fast Writes

Writes in Cassandra are extremely fast because they are simply appended to commit logs on any

available node and no locks are imposed in the critical path. A write operation involves a write into

a commit log for durability and recoverability and an update into an in-memory data structure. The

write into the in-memory data structure is performed only after a successful write into the commit

log. Typically, there is a dedicated disk on each machine for the commit log because all writes into

the commit log are sequential and so we can maximize disk throughput. When the in-memory data

structure crosses a certain threshold, calculated based on data size and number of objects, it dumps

itself to disk.

All writes are sequential to disk and also generate an index for effi cient lookup based on a row-

key. These indexes are also persisted along with the data. Over time, many such logs could exist

on disk and a merge process could run in the background to collate the different logs into one log.

This process of compaction merges data in SSTables, the underlying storage format. It also leads to

rationalization of keys and combination of columns, deletions of data items marked for deletion,

and creation of new indexes.

Search WWH ::

Custom Search

Home