Cost-effective data reliability assurance for data maintenance - Reliability Assurance of Big Data in the Cloud

Database Reference

In-Depth Information

expected storage duration of the data file in one go, and returns the checking interval

values set as the result (lines 3-11).

In addition to the application of the algorithm for data storage with variable disk

failure rate, it is also applicable when the disk failure rate is a constant (e.g., virtual

disks located over the virtual layer of the Cloud could apply such reliability model). In

that case, the minimum replication algorithm is significantly simplified, as the steps

of calculating average failure rate (line 1) and obtaining piecewise functions (lines

5-6) can be omitted. The process of solving equation (6.1) only needs to be conducted

once, and the checking interval obtained does not change unless any replica of the data

file is lost and the corresponding disk is changed.

6.4.2 Metadata distribution algorithm

To manage the large amount of data files in the Cloud, PRCR must have a practically

sufficient capacity. Meanwhile, to fully use the capacity of PRCR, the utilization of

PRCR nodes must be maximized. To address this issue, we propose our metadata dis-

tribution algorithm. There are two purposes of the algorithm. First, it maximizes the

utilization of PRCR, so that the running cost of PRCR for maintaining each data file

is minimized. Second, it distributes the metadata of data files to the appropriate PRCR

nodes, so that a sufficient data reliability assurance RA (1)

k

can be provided for meet-

ing the data reliability requirement.

6.4.2.1 The maximum capacity of PRCR

The maximum capacity of PRCR stands for the maximum number of data files that

PRCR is able to manage. In PRCR, the main component for replica management is the

PRCR node. As mentioned in Section 6.2 , PRCR may contain multiple PRCR nodes.

Therefore, the maximum capacity of PRCR is the sum of the maximum capacities of

all PRCR nodes. The maximum capacity of each PRCR node is determined by two

parameters, which are the metadata scanning time and the scan cycle of the PRCR

node. Note that the metadata scanning time is the time taken for scanning the metadata

of a data file in the data table. The maximum capacity of PRCR can be presented by

equation (6.2) . In the equation, C indicates the maximum capacity of PRCR, T cycle

i is

the scan cycle of PRCR node i , T sca i is the metadata scanning time of PRCR node i and

N is the number of PRCR nodes in PRCR.

i

T

∑

N

cycle

C

=

(6.2)

i

=

1

scan

6.4.2.2 Provision of suficient data reliability assurance

Although the maximum capacity of PRCR nodes can be calculated as just mentioned,

in order to provide sufficient data reliability assurance to the data files, the scan cycle

of the PRCR node must be no bigger than the checking interval values of data files.

Reliability Assurance of Big Data in the Cloud

Search WWH ::

Custom Search

Home