Database Reference
In-Depth Information
Hard disk capacity
A rough disk space calculation of the user that will be stored in Cassandra involves adding
up data stored in four data components on disk: commit logs, SSTable, an index file, and a
bloom filter. When the incoming data is compared with the data on the disk, you need to
take account of the database overheads associated with each type of data. The data on disk
can be about two times as large as raw data. Disk usage can be calculated using the follow-
ing code snippet:
# Size of one normal column
column_size (in bytes) = column_name_size + column_val_size
+ 15
# Size of an expiring or counter column
col_size (in bytes) = column_name_size + column_val_size + 23
# Size of a row
row_size (bytes) = size_of_all_columns + row_key_size + 23
# Primary index file size
index_size (bytes) = number_of_rows * (32 + mean_key_size)
# Addition space consumption due to replication
replication_overhead = total_data_size * (replication_factor
- 1)
Apart from this, the disk also faces high read-write during compaction. Compaction is the
process that merges SSTables to improve search efficiency. The important thing about com-
paction is that it may, in the worst case, utilize as much space as occupied by user data.
Therefore, it is a good idea to have a lot of space left. We'll discuss this again, but it de-
pends on the choice of compaction_strategy that is applied. For LeveledCom-
pactionStrategy , a balance of 10 percent is enough. On the other hand,
SizeTieredCompactionStrategy requires 50 percent free disk space in the worst
case. Here are some rules of thumb with regard to disk choice and disk operations:
Commit logs and data files on separate disks : Commit logs are updated on each
write and are read-only for startups, which is rare. A data directory, on the other
hand, is used to flush MemTables into SSTables asynchronously. It is read through
and written on during compaction, and most importantly, it might be looked up by
Search WWH ::




Custom Search