Cassandra Architecture - Mastering Apache Cassandra

Database Reference

In-Depth Information

all values are 1s, this means that the value may exist in the collection associated with this

bloom filter. We cannot guarantee its presence in the collection because it is possible that

there exist other k keys whose i th hash function filled the same spot in the array as the j th

hash of the key that we are looking for.

Removal of a key from a bloom filter as in its original avatar is not possible. One may

break multiple keys because multiple keys may have the same index bit set to 1 in the ar-

ray for different hashes. Counting bloom filter solves these issues by changing the bit ar-

ray into an integer array where each element works as a counter; insertion increments the

counter and deletion decrements it.

Effectiveness of the bloom filter depends on the size of the collection it is applied to. The

bigger the collection associated with the bloom filter, the higher the frequency of false

positives will be (because the array will be more densely packed with 1s). Another thing

that governs bloom filter is the quality of a good hash function. A good hash function will

distribute hash values evenly in the array, and it will be fast. One does not look at the

cryptic strength of the hash function here, so the Murmur3 hash will be preferred over the

SHA1 hash.

From Cassandra 1.2 onwards, bloom filters are stored off heap memory. This is done to al-

leviate pressure on heap memory because Java garbage collectors start to underperform

for heap size 8 GB or more, and that affects Cassandra's performance.

Index files

Index files are companion files of SSTables. Similar to the bloom filter, there exists one

index file per SSTable. It contains all the row keys in the SSTable and its offset is at the

point where the row starts in the data file.

At startup, Cassandra reads every 128th key (configurable) into the memory (sampled in-

dex). When the index is looking for a row key (after the bloom filter hinted that the row

key might be in this SSTable), Cassandra performs a binary search on the sampled index

in memory. Followed by a positive result from the binary search, Cassandra will have to

read a block in the index file from the disk starting from the nearest value lower than the

value that we are looking for.

Let's take an example. See the figure in the Read repair and anti-entropy section, where

Cassandra is looking for a row key 404. It is not in MemTable. On querying the bloom fil-

ter of a certain SSTable, Cassandra gets a positive nod that this SSTable may contain the

row. The next step is to look into the SSTable. But before we start scanning the SSTable

Search WWH ::

Custom Search

Home