Literature review - Reliability Assurance of Big Data in the Cloud

Database Reference

In-Depth Information

these data reliability models, some models [55,57] are based on simple permutations

and combinations to analyze the probability of data loss, while other models [4,19,60]

are based on more complicated Markov chains to analyze changes in data redundancy

level. In one study [55] , data reliability of the system was measured by data missing

rate and file missing rate, and the issue of maximizing data reliability with limited

storage capacity was investigated. In another study [57] , the researchers proposed an

analytical replication model for determining the optimal number of replica servers,

catalog servers, and catalog sizes to guarantee a given overall data reliability. In other

research studies [4,19,60] , tests were conducted on different aspects of similar sce-

narios. One study [4] investigated the issue of how to dynamically maintain a certain

replication level of a large-scale data storage system by gradually creating new rep-

licas. Another study [60] proposed an analytical framework to reason and quantify

the impact of replica placement policy on system reliability. And another study [19]

investigated the issue of maintaining a long-running distributed system using solely

data replication. The similarity of these three studies is that they all assume a relatively

high replication level ( N replicas/bricks/data blocks) in a large-scale data storage sys-

tem environment, while replicas are gradually created when needed.

In Cloud computing, data replication technologies have also been widely adopted

in current commercial Cloud systems. Some typical examples include Amazon Sim-

ple Storage Service (Amazon S3) [10] , GFS [9] , HDFS [8] , and so forth. Although

data replication has been widely used, there is a side effect because it would consume

considerable extra storage resources and incur significant additional cost. To address

this issue, Amazon S3 published its Reduced Redundancy Storage (RRS) solution to

reduce the storage cost [10] . However, such cost reduction is realized by sacrificing

data reliability. By using RRS, only a lower level of data reliability can be ensured.

Some of our works have made contributions in reducing storage cost in the Cloud

based on data replication. For example, in one of our studies [61] , we propose a cost-

effective dynamic data replication strategy for data reliability in Cloud data centers,

in which an incremental replication method is applied to reduce the average replica

number while meeting the data reliability requirement. However, for long-term stor-

age or storage with a very high reliability requirement, this strategy could generate

even more than three replicas for the data, so that its ability to reduce storage cost is

limited.

2.2.2 Erasure coding for data reliability

Besides data replication, another type of data storage approaches leverages erasure

coding techniques to add data redundancy level so as to reach the data reliability as-

surance goal. Currently, distributed storage systems with erasure coding-based storage

schema include OceanStore [6] , Ivy [62] , Windows Azure [5] , and so forth.

Erasure coding is a coding approach that reorganizes the original information into

another form. In information theory, it creates a mathematical function referred to as

polynomial interpolation or oversampling and transforms a message of k symbols into

a longer message (code word) with n symbols such that the original message can be

recovered from a subset of n symbols [63] . By transforming the message, m redundant

Reliability Assurance of Big Data in the Cloud

Search WWH ::

Custom Search

Home