Storing Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

and Voldemort, and many of the concepts introduced in this section are

applicable to those other technologies. The internal data model used by

Cassandra is also covered, primarily for historical reasons, because CQL

abstracts much of these internals with its own somewhat incompatible data

model.

Server Architecture

The Cassandra server architecture is a “master-less” cluster of nodes, the

goal being to eliminate single points of failure and allow linear scaling.

To achieve this, Cassandra uses a distributed hash table approach to data

management. In this model, each row of data is assigned a partition key.

This key defines which server stores a particular piece of information. To

improve data durability, Cassandra can also replicate this data to other

nodes through consistent hashing.

Early versions (prior to 1.2) used a server-based consistent hashing

technique that required a fair bit of maintenance to calculate and assign

tokens to each node to tell it which parts of the hash range each node would

store. Newer versions of Cassandra introduced the concept of the virtual

node, called a “vnode,” which breaks the hash range into much smaller

pieces that are then randomly distributed among the nodes.

Each of these virtual nodes maintains its own indexes for the data contained

within that partition as well as a specialized data structure called a Bloom

Filter (discussed in detail in Chapter 10, “Approximating Streaming Data

with Sketching”) that helps to quickly determine if a query needs any data

from a particular virtual node.

A node in a cluster maintains a map of other servers through a peer-to-peer

communication protocol. Called the “gossip” protocol, each server

exchanges state information with up to three other servers in the cluster

roughly once every second. This information also contains state information

for other servers, so a given node will quickly learn the entire cluster

topology as its gossips with its peers.

In addition to the gossip protocol, the “snitch” protocol determines the

local network topology. This helps Cassandra optimize its request routing

and discover nodes that are in an active, but degraded, state. In practice,

“sick” nodes arise for a variety of reasons and are fairly common as a result.

There are different kinds of snitches optimized to various deployments. For

Search WWH ::

Custom Search

Home