Database Reference
In-Depth Information
4.2
Distributed Storage System
The first challenge brought about by big data is how to develop a large scale
distributed storage system for strategic preservation of data and efficient data
processing and analysis. To use a distributed system to store massive data, the
following factors should be taken into consideration:
￿
Consistency : a distributed storage system requires multiple servers to coopera-
tively store data. As there are more servers, the probability of server failures will
be larger. Usually data is divided into multiple pieces to be stored at different
servers to ensure availability in case of server failure. However, server failures
and parallel storage may cause inconsistency among different copies of the same
data. Consistency refers to assuring that multiple copies of the same data are
identical.
￿
Availability : a distributed storage system operates in multiple sets of servers.
As more servers are used, server failures are inevitable. It would be desirable
if the entire system is not serious affected with respect to serving the reading and
writing requests from customer terminals. This property is called availability.
￿
Partition Tolerance : multiple servers in a distributed storage system are con-
nected by a network. The network could have link/node failures or temporary
congestion. The distributed system should have a certain level of tolerance to
problems caused by network failures. It would be desirable that the distributed
storage still works well when the network is partitioned.
Eric Brewer proposed a CAP [ 1 , 2 ] theory in 2000, which indicated that a
distributed system could not simultaneously meet the requirements on consistency,
availability, and partition tolerance; at most two of the three requirements can
be satisfied simultaneously. Seth Gilbert and Nancy Lynch from MIT proved the
correctness of CAP theory in 2002. Since consistency, availability, and partition
tolerance could not be achieved simultaneously, we can have a CA system by
ignoring partition tolerance, a CP system by ignoring availability, and an AP system
that ignores consistency, according to different design goals. The three systems are
discussed in the following.
CA systems do not have partition tolerance, i.e, they could not handle network
failures. Therefore, CA systems are generally deemed as storage systems with a
single server, such as the traditional small-scale relational databases. Such systems
feature single copy of data, such that consistency is easily ensured. Availability
is guaranteed by the excellent design of relational databases. However, since CA
systems could not handle network failures, they could not be expanded to use
many servers. This is way most large-scale storage systems are CP systems and
AP systems.
Compared with CA systems, CP systems ensure partition tolerance. Therefore,
CP systems can be expanded to become distributed systems. CP systems generally
maintain several copies of the same data in order to ensure a level of fault tolerance.
CP systems also ensure data consistency, i.e., multiple copies of the same data
Search WWH ::




Custom Search