Big Data Storage - Big Data: Related Technologies, Challenges and Future Prospects

Database Reference

In-Depth Information

Dynamo and Cassandra are two popular AP systems. Cassandra, with sound

expandability, is used for storing massive textual data by mainstream commercial

online SNS companies, such as Facebook and Twitter. Specifically, Cassandra

utilizes the Consistent Hash algorithm to randomly and evenly map Key spaces

of user identifier spaces of servers to the same value domain space, and enables

servers to manage user data corresponding to the Keys of adjacent mapped values.

This way, dynamic changes at any servers in the systems only affect the data

corresponding to a small segment of value domains undertaken by themselves.

Mainstream SNSs utilize such distributed Key-Value storage approach, so as to

better meet the demands of expandability of large-scale online SNS systems and

load balance for the servers, and be adaptive to dynamic changes of systems.

In order to support the storage of textual data of users, Cassandra inherits the

column family model of BigTable to aggregate data with similar features into a

column family. What is different from BigTable is that Cassandra may expand the

concept of column family to a super column family-the column family of column

families. On Cassandra nodes, every column family corresponds to a MemTable,

a resident memory. When nodes write data, it first writes data into MemTable. In

proper occasions, e.g., memory space occupied by MemTable reaches the upper

bound or after a fixed amount of time, MemTable is stored into a corresponding

SSTable of a disk. SSTable has a large operation throughput because of its sequential

writing approach. The system builds a local index for every block including every

piece of data written in the disk and then Cassandra stores the index in the internal

memory in the form of Bloom Filter compression [ 4 ]. Because the compressed index

excludes the relative position of the block in the file system, Cassandra does not

perform well with regard to random reading.

4.3

Storage Mechanism for Big Data

Considerable research on big data promotes the development of storage mechanisms

for big data. Existing storage mechanisms of big data may be classified into three

bottom-up levels: (a) file systems, (b) databases, and (c) programming models.

File systems are the foundation of the applications at upper levels. Google's

GFS is an expandable distributed file system to support large-scale, distributed,

data-intensive applications [ 5 ]. GFS uses cheap commodity servers to achieve fault-

tolerance and provides customers with high-performance services. GFS supports

large-scale file applications with more frequent reading than writing. However, GFS

also has some limitations, such as a single point of failure and poor performances

for small files. Such limitations have been overcome by Colossus [ 6 ], the successor

of GFS.

In addition, other companies and researchers also have their solutions to meet

the different demands for storage of big data. For example, HDFS and Kosmosfs

are derivatives of open source codes of GFS. Microsoft developed Cosmos [ 7 ]

to support its search and advertisement business. Facebook utilizes Haystack [ 8 ]

Search WWH ::

Custom Search

Home