Surveying Database Internals - Professional NoSQL

Databases Reference

In-Depth Information

Many concepts in Hypertable and HBase are the same and so repeating those same concepts here

is not benefi cial. However, a passing reference to an idea that is important in both places but not

discussed yet is that of a Bloom Filter.

Bloom Filter

Bloom Filter is a probabilistic data structure used to test whether an element is a member of a set.

Think of a Bloom Filter as an array of m number of bits. An empty Bloom Filter has a value of 0 in

all its m positions. Now if elements a , b , and c are members of a set, then they are mapped via a set

of k hash functions to the Bloom Filter. This means each of the members, that is a , b , and c are

mapped via the k different hash functions to k positions on the Bloom Filter. Whenever a member

is mapped via a hash function to a specifi c position in the m bit array the value in that particular

position is set to 1. Different members, when passed through the hash functions, could map to the

same position in the Bloom Filter.

Now to test if a given element, say w , is a member of a set, you need to pass it through the k

hash functions and map the outcome on the Bloom Filter array. If the value at any of the mapped

positions is 0, the element is not a member of the set. If the value at all the positions is 1, then either

the element is a member of the set or the element maps to one or more position where the value was

set to 1 by another element. Therefore, false positives are possible but false negatives are not.

Learn more about Bloom Filter, explained in the context of PERL, at www.perl.com/pub/2004/

04/08/bloom_filters.html .

APACHE CASSANDRA

Apache Cassandra is simultaneously a very popular and infamous NoSQL database. A few

examples that used Cassandra in the early part of the topic introduced the core ideas of the store.

In this section, I review Cassandra's core architecture to understand how it works.

Peer-to-Peer Model

Most databases, including the most popular of NoSQL stores, follow a master-slave model for

scaling out and replication. This means for each set, writes are committed to the master node and

replicated down to the slaves. The slaves provide enhanced read scalability but not write scalability.

Cassandra moves away from the master-slave model and instead uses a peer-to-peer model. This

means there is no single master but all the nodes are potentially masters. This makes the writes and

reads extremely scalable and even allows nodes to function in cases of partition tolerance. However,

extreme scalability comes at a cost, which in this case is a compromise in strong consistency. The

peer-to-peer model follows a weak consistency model.

Based on Gossip and Anti-entropy

Cassandra's peer-to-peer scalability and eventual consistency model makes it important to

establish a protocol to communicate among the peers and detect node failure. Cassandra relies on

a gossip-based protocol to communicate among the nodes. Gossip, as the name suggests, uses an

Search WWH ::

Custom Search

Home