Introduction - Reliability Assurance of Big Data in the Cloud

Database Reference

In-Depth Information

current distributed data storage/management systems in both industry and academia,

which include examples such as OceanStore [6] , DataGrid [7] , Hadoop Distributed

File System [8] , Google File System [9] , Amazon S3 [10] , and so forth. In these stor-

age systems, several replicas are created for each piece of data. These replicas are

stored in different storage devices, so that the data have better chance to survive when

storage device failures occur.

In recent years, Cloud computing is emerging as the latest distributed computing

paradigm, which provides redundant, inexpensive, and scalable resources in a pay-as-

you-go fashion to meet various application requirements [11] . Since the advent of Cloud

computing in late 2007 [12] , it has fast become one of the most promising distributed

solutions in both industry and academia. Nowadays, with the rapid growth of Cloud

computing, the size of Cloud storage is expanding at a dramatic speed. It is estimated

that by 2015 the data stored in the Cloud will reach 0.8 ZB (i.e., 0.8 × 10 21 bytes or

800,000,000 TB), while more data are “touched” by the Cloud within their life cycles

[13] . For maintaining such a large amount of Cloud data, data reliability in the Cloud is

considered more important than ever before. However, due to the accelerating growth of

Cloud data, current replication-based data reliability management has become a bottle-

neck for the development of Cloud data storage. For example, storage systems such as

Amazon S3, Google File System, and Hadoop Distributed File System all adopt similar

data replication strategies called the “conventional multi-replica replication strategy,” in

which a fixed number of replicas (normally three) are stored for all data to ensure the re-

liability requirement. For storage of the huge amounts of Cloud data, these conventional

multi-replica replication strategies consume a lot of storage resources for additional rep-

licas. This could cause negative effects for both the Cloud storage providers and users.

On one hand, from the Cloud storage provider's perspective, the excessive consumption

of storage resources leads to a big storage overhead and increases the cost for providing

the storage service. On the other hand, from the Cloud storage user's perspective, ac-

cording to the pay-as-you-go pricing model, the excessive storage resource usage will fi-

nally be paid by the storage users. For data-intensive Cloud applications specifically, the

incurred excessive storage cost could be huge. Therefore, Cloud-based applications have

put forward a higher demand for cost-effective management of Cloud storage. While the

requirement of data reliability should be met in the first place, data in the Cloud needs to

be stored in a highly cost-effective manner.

1.2

Background of Cloud storage

In this section, we briefly introduce the background knowledge of Cloud storage.

First, we introduce the distinctive features of Cloud storage systems. Second, we in-

troduce the Cloud data life cycle.

1.2.1 Distinctive features of Cloud storage systems

Data reliability is closely related to the structure of the storage system and how the

storage system is being used. Different from other distributed storage systems, the

Search WWH ::

Custom Search

Home