Databases Reference
In-Depth Information
idea similar to the concept of human gossip. In the case of gossip a peer arbitrarily chooses to send
messages to other nodes. In Cassandra, gossip is more systematic and is triggered by a Gossiper
class that runs on the basis of a timer. Nodes register themselves with the Gossiper class and
receive updates as gossip propagates through the network. Gossip is meant for large distributed
systems and is not particularly reliable. In Cassandra, the Gossiper class keeps track of nodes as
gossip spreads through them.
In terms of the workfl ow, every timer-driven Gossiper action requires the Gossiper to choose a
random node and send that node a message. This message is named GossipDigestSyncMessage.
The receiving node, if active, sends an acknowledgment back to the Gossiper. To complete gossip,
the Gossiper sends an acknowledgment in response to the acknowledgment it receives. If the
communication completes all steps, gossip successfully shares the state information between the
Gossiper and the node. If during gossip the communication fails, it indicates that possibly the node
may be down.
To detect failure, Cassandra uses an algorithm called the Phi Accrual Failure Detection. This
method of detection converts the binary spectrum of node alive or node dead to a level in the middle
that indicates the suspicion level. The traditional idea of failure detection via periodic heartbeats is
therefore replaced with a continuous assessment of suspicion levels.
Whereas gossip keeps the nodes in sync and repairs any temporary damages, more severe damages
are identifi ed and repaired via an anti-entropy mechanism. In this process, data in a column-family
is converted to a hash using the Merkle tree. The Merkle tree representations compare data between
neighboring nodes. If there is a discrepancy, the nodes are reconciled and repaired. The Merkle tree
is created as a snapshot during a major compaction operation.
This reconfi rms that the weak consistency in Cassandra may require reading from a Quorum to
avoid inconsistencies.
Fast Writes
Writes in Cassandra are extremely fast because they are simply appended to commit logs on any
available node and no locks are imposed in the critical path. A write operation involves a write into
a commit log for durability and recoverability and an update into an in-memory data structure. The
write into the in-memory data structure is performed only after a successful write into the commit
log. Typically, there is a dedicated disk on each machine for the commit log because all writes into
the commit log are sequential and so we can maximize disk throughput. When the in-memory data
structure crosses a certain threshold, calculated based on data size and number of objects, it dumps
itself to disk.
All writes are sequential to disk and also generate an index for effi cient lookup based on a row-
key. These indexes are also persisted along with the data. Over time, many such logs could exist
on disk and a merge process could run in the background to collate the different logs into one log.
This process of compaction merges data in SSTables, the underlying storage format. It also leads to
rationalization of keys and combination of columns, deletions of data items marked for deletion,
and creation of new indexes.
Search WWH ::




Custom Search