Database Reference
In-Depth Information
Dynamo and Cassandra are two popular AP systems. Cassandra, with sound
expandability, is used for storing massive textual data by mainstream commercial
online SNS companies, such as Facebook and Twitter. Specifically, Cassandra
utilizes the Consistent Hash algorithm to randomly and evenly map Key spaces
of user identifier spaces of servers to the same value domain space, and enables
servers to manage user data corresponding to the Keys of adjacent mapped values.
This way, dynamic changes at any servers in the systems only affect the data
corresponding to a small segment of value domains undertaken by themselves.
Mainstream SNSs utilize such distributed Key-Value storage approach, so as to
better meet the demands of expandability of large-scale online SNS systems and
load balance for the servers, and be adaptive to dynamic changes of systems.
In order to support the storage of textual data of users, Cassandra inherits the
column family model of BigTable to aggregate data with similar features into a
column family. What is different from BigTable is that Cassandra may expand the
concept of column family to a super column family-the column family of column
families. On Cassandra nodes, every column family corresponds to a MemTable,
a resident memory. When nodes write data, it first writes data into MemTable. In
proper occasions, e.g., memory space occupied by MemTable reaches the upper
bound or after a fixed amount of time, MemTable is stored into a corresponding
SSTable of a disk. SSTable has a large operation throughput because of its sequential
writing approach. The system builds a local index for every block including every
piece of data written in the disk and then Cassandra stores the index in the internal
memory in the form of Bloom Filter compression [ 4 ]. Because the compressed index
excludes the relative position of the block in the file system, Cassandra does not
perform well with regard to random reading.
4.3
Storage Mechanism for Big Data
Considerable research on big data promotes the development of storage mechanisms
for big data. Existing storage mechanisms of big data may be classified into three
bottom-up levels: (a) file systems, (b) databases, and (c) programming models.
File systems are the foundation of the applications at upper levels. Google's
GFS is an expandable distributed file system to support large-scale, distributed,
data-intensive applications [ 5 ]. GFS uses cheap commodity servers to achieve fault-
tolerance and provides customers with high-performance services. GFS supports
large-scale file applications with more frequent reading than writing. However, GFS
also has some limitations, such as a single point of failure and poor performances
for small files. Such limitations have been overcome by Colossus [ 6 ], the successor
of GFS.
In addition, other companies and researchers also have their solutions to meet
the different demands for storage of big data. For example, HDFS and Kosmosfs
are derivatives of open source codes of GFS. Microsoft developed Cosmos [ 7 ]
to support its search and advertisement business. Facebook utilizes Haystack [ 8 ]
Search WWH ::




Custom Search