Database Reference
In-Depth Information
these data reliability models, some models [55,57] are based on simple permutations
and combinations to analyze the probability of data loss, while other models [4,19,60]
are based on more complicated Markov chains to analyze changes in data redundancy
level. In one study [55] , data reliability of the system was measured by data missing
rate and file missing rate, and the issue of maximizing data reliability with limited
storage capacity was investigated. In another study [57] , the researchers proposed an
analytical replication model for determining the optimal number of replica servers,
catalog servers, and catalog sizes to guarantee a given overall data reliability. In other
research studies [4,19,60] , tests were conducted on different aspects of similar sce-
narios. One study [4] investigated the issue of how to dynamically maintain a certain
replication level of a large-scale data storage system by gradually creating new rep-
licas. Another study [60] proposed an analytical framework to reason and quantify
the impact of replica placement policy on system reliability. And another study [19]
investigated the issue of maintaining a long-running distributed system using solely
data replication. The similarity of these three studies is that they all assume a relatively
high replication level ( N replicas/bricks/data blocks) in a large-scale data storage sys-
tem environment, while replicas are gradually created when needed.
In Cloud computing, data replication technologies have also been widely adopted
in current commercial Cloud systems. Some typical examples include Amazon Sim-
ple Storage Service (Amazon S3) [10] , GFS [9] , HDFS [8] , and so forth. Although
data replication has been widely used, there is a side effect because it would consume
considerable extra storage resources and incur significant additional cost. To address
this issue, Amazon S3 published its Reduced Redundancy Storage (RRS) solution to
reduce the storage cost [10] . However, such cost reduction is realized by sacrificing
data reliability. By using RRS, only a lower level of data reliability can be ensured.
Some of our works have made contributions in reducing storage cost in the Cloud
based on data replication. For example, in one of our studies [61] , we propose a cost-
effective dynamic data replication strategy for data reliability in Cloud data centers,
in which an incremental replication method is applied to reduce the average replica
number while meeting the data reliability requirement. However, for long-term stor-
age or storage with a very high reliability requirement, this strategy could generate
even more than three replicas for the data, so that its ability to reduce storage cost is
limited.
2.2.2 Erasure coding for data reliability
Besides data replication, another type of data storage approaches leverages erasure
coding techniques to add data redundancy level so as to reach the data reliability as-
surance goal. Currently, distributed storage systems with erasure coding-based storage
schema include OceanStore [6] , Ivy [62] , Windows Azure [5] , and so forth.
Erasure coding is a coding approach that reorganizes the original information into
another form. In information theory, it creates a mathematical function referred to as
polynomial interpolation or oversampling and transforms a message of k symbols into
a longer message (code word) with n symbols such that the original message can be
recovered from a subset of n symbols [63] . By transforming the message, m redundant
 
Search WWH ::




Custom Search