Database Reference
In-Depth Information
all values are 1s, this means that the value may exist in the collection associated with this
bloom filter. We cannot guarantee its presence in the collection because it is possible that
there exist other k keys whose i th hash function filled the same spot in the array as the j th
hash of the key that we are looking for.
Removal of a key from a bloom filter as in its original avatar is not possible. One may
break multiple keys because multiple keys may have the same index bit set to 1 in the ar-
ray for different hashes. Counting bloom filter solves these issues by changing the bit ar-
ray into an integer array where each element works as a counter; insertion increments the
counter and deletion decrements it.
Effectiveness of the bloom filter depends on the size of the collection it is applied to. The
bigger the collection associated with the bloom filter, the higher the frequency of false
positives will be (because the array will be more densely packed with 1s). Another thing
that governs bloom filter is the quality of a good hash function. A good hash function will
distribute hash values evenly in the array, and it will be fast. One does not look at the
cryptic strength of the hash function here, so the Murmur3 hash will be preferred over the
SHA1 hash.
From Cassandra 1.2 onwards, bloom filters are stored off heap memory. This is done to al-
leviate pressure on heap memory because Java garbage collectors start to underperform
for heap size 8 GB or more, and that affects Cassandra's performance.
Index files
Index files are companion files of SSTables. Similar to the bloom filter, there exists one
index file per SSTable. It contains all the row keys in the SSTable and its offset is at the
point where the row starts in the data file.
At startup, Cassandra reads every 128th key (configurable) into the memory (sampled in-
dex). When the index is looking for a row key (after the bloom filter hinted that the row
key might be in this SSTable), Cassandra performs a binary search on the sampled index
in memory. Followed by a positive result from the binary search, Cassandra will have to
read a block in the index file from the disk starting from the nearest value lower than the
value that we are looking for.
Let's take an example. See the figure in the Read repair and anti-entropy section, where
Cassandra is looking for a row key 404. It is not in MemTable. On querying the bloom fil-
ter of a certain SSTable, Cassandra gets a positive nod that this SSTable may contain the
row. The next step is to look into the SSTable. But before we start scanning the SSTable
Search WWH ::




Custom Search