Database Reference
In-Depth Information
not make it to the in-memory store (the memtable, discussed in a moment), it will still be pos-
sible to recover the data.
After it's written to the commit log, the value is written to a memory-resident data structure
called the memtable. When the number of objects stored in the memtable reaches a threshold,
the contents of the memtable are flushed to disk in a file called an SSTable. A new memtable is
then created. This flushing is a nonblocking operation; multiple memtables may exist for a single
column family, one current and the rest waiting to be flushed. They typically should not have to
wait very long, as the node should flush them very quickly unless it is overloaded.
Each commit log maintains an internal bit flag to indicate whether it needs flushing. When a
write operation is first received, it is written to the commit log and its bit flag is set to 1 . There is
only one bit flag per column family, because only one commit log is ever being written to across
the entire server. All writes to all column families will go into the same commit log, so the bit
flag indicates whether a particular commit log contains anything that hasn't been flushed for a
particular column family. Once the memtable has been properly flushed to disk, the correspond-
ing commit log's bit flag is set to 0 , indicating that the commit log no longer has to maintain
that data for durability purposes. Like regular logfiles, commit logs have a configurable rollover
threshold, and once this file size threshold is reached, the log will roll over, carrying with it any
extant dirty bit flags.
The SSTable is a concept borrowed from Google's Bigtable. Once a memtable is flushed to disk
as an SSTable, it is immutable and cannot be changed by the application. Despite the fact that
SSTables are compacted, this compaction changes only their on-disk representation; it essentially
performs the “merge” step of a mergesort into new files and removes the old files on success.
NOTE
The idea that “SSTable” is a compaction of “Sorted String Table” is somewhat of a misnomer for Cas-
sandra, because the data is not stored as strings on disk.
Each SSTable also has an associated Bloom filter, which is used as an additional performance
enhancer (see Bloom Filters ).
All writes are sequential, which is the primary reason that writes perform so well in Cassandra.
No reads or seeks of any kind are required for writing a value to Cassandra because all writes are
append operations. This makes one key limitation on performance the speed of your disk. Com-
paction is intended to amortize the reorganization of data, but it uses sequential IO to do so. So
the performance benefit is gained by splitting; the write operation is just an immediate append,
and then compaction helps to organize for better future read performance. If Cassandra naively
inserted values where they ultimately belonged, writing clients would pay for seeks up front.
Search WWH ::




Custom Search