Database Reference
In-Depth Information
Bloom Filters
A bloom filter is a space-efficient probabilistic data structure that is used to de-
termine whether or not an element is a member of a set. False positives are pos-
sible. False negatives are not. A false positive means that the data structure thinks
the value is on the node when it actually is not. A false negative is when the bloom
filter thinks the data is not on a node when it actually is.
The reason that bloom filters are used in Cassandra is to determine whether
an SSTable has data for a particular row. They are used for index scans, but
not for range scans. On a per-ColumnFamily basis, the higher the
bloom_filter_fp_chance setting, the less memory will be used. However,
this will result in greater disk I/O as the SSTables get more highly fragmented.
It is important to note that starting in Cassandra version 1.2, bloom filters are
no longer stored on-heap. This means that they don't need to be taken into consid-
eration when determining the maximum memory sizes for the JVM.
Compaction Types
Initially, all data passed into Cassandra hits the disk via the CommitLog. Once the
CommitLog segment is complete, it gets rolled up (or compacted) into separate
SSTables. There are two common strategies for this compaction. There is the de-
fault type of size-tiered or the less commonly used type of leveled.
SizeTieredCompaction
The default type of compaction on Cassandra, SizeTieredCompaction, is made
for insert-heavy workloads that are lighter on the reads. The key issue with
SizeTieredCompaction is that it requires at least twice the available size on disk
in order to be used properly. In other words, if you have 400GB of data in your
SSTables on a 500GB drive, you will likely not be able to complete a compaction.
Compactions can take up to two times the size of the data on disk in the worst
of scenarios. The size of the SSTables being compacted is what determines how
much available disk space is required for the compaction.
Search WWH ::




Custom Search