Big Data Storage - Big Data: Related Technologies, Challenges and Future Prospects

Database Reference

In-Depth Information

4.2

Distributed Storage System

The first challenge brought about by big data is how to develop a large scale

distributed storage system for strategic preservation of data and efficient data

processing and analysis. To use a distributed system to store massive data, the

following factors should be taken into consideration:

Consistency : a distributed storage system requires multiple servers to coopera-

tively store data. As there are more servers, the probability of server failures will

be larger. Usually data is divided into multiple pieces to be stored at different

servers to ensure availability in case of server failure. However, server failures

and parallel storage may cause inconsistency among different copies of the same

data. Consistency refers to assuring that multiple copies of the same data are

identical.

Availability : a distributed storage system operates in multiple sets of servers.

As more servers are used, server failures are inevitable. It would be desirable

if the entire system is not serious affected with respect to serving the reading and

writing requests from customer terminals. This property is called availability.

Partition Tolerance : multiple servers in a distributed storage system are con-

nected by a network. The network could have link/node failures or temporary

congestion. The distributed system should have a certain level of tolerance to

problems caused by network failures. It would be desirable that the distributed

storage still works well when the network is partitioned.

Eric Brewer proposed a CAP [ 1 , 2 ] theory in 2000, which indicated that a

distributed system could not simultaneously meet the requirements on consistency,

availability, and partition tolerance; at most two of the three requirements can

be satisfied simultaneously. Seth Gilbert and Nancy Lynch from MIT proved the

correctness of CAP theory in 2002. Since consistency, availability, and partition

tolerance could not be achieved simultaneously, we can have a CA system by

ignoring partition tolerance, a CP system by ignoring availability, and an AP system

that ignores consistency, according to different design goals. The three systems are

discussed in the following.

CA systems do not have partition tolerance, i.e, they could not handle network

failures. Therefore, CA systems are generally deemed as storage systems with a

single server, such as the traditional small-scale relational databases. Such systems

feature single copy of data, such that consistency is easily ensured. Availability

is guaranteed by the excellent design of relational databases. However, since CA

systems could not handle network failures, they could not be expanded to use

many servers. This is way most large-scale storage systems are CP systems and

AP systems.

Compared with CA systems, CP systems ensure partition tolerance. Therefore,

CP systems can be expanded to become distributed systems. CP systems generally

maintain several copies of the same data in order to ensure a level of fault tolerance.

CP systems also ensure data consistency, i.e., multiple copies of the same data

Search WWH ::

Custom Search

Home