Cost-effective data reliability assurance for data maintenance - Reliability Assurance of Big Data in the Cloud

Database Reference

In-Depth Information

Figure 6.4 Pseudo code of metadata distribution algorithm

addition to the algorithm, to distribute metadata, there are several issues that need to

be further addressed.

• First, the capacity of each PRCR node is limited; when more and more data files are man-

aged by PRCR, the capacity of PRCR nodes could gradually run out. To address this issue,

the independence of each PRCR node has provided great elasticity to the organization of

PRCR. When one of the PRCR nodes is reaching or about to reach its maximized capacity,

a new PRCR node is created, where the time for the scan cycle of the new PRCR node can

be set to the same as the fully occupied PRCR node, which should be considered according

to the data management requirement.

• Second, the data reliability model with a variable disk failure rate has led to the side effect

that there exist multiple checking interval values for each data file, that is, the checking

interval changes from time to time. Once the checking interval increases to a threshold that

is equal to the scan cycle of another PRCR node, current metadata distribution becomes

sub-optimal. To address this issue, several solutions could be applied. For example, the scan

cycles of PRCR nodes need to be well organized so that each data file is managed by the

PRCR node with a scan cycle smaller than all the checking interval values that the data files

could have. Or, if the metadata of data files need to be redistributed no matter how, the redis-

tribution could be conducted in a batch mode to reduce its impact and computation overhead.

• Third, the metadata are distributed according to the calculation of the minimum replication

algorithm. However, the predicted storage duration could be different from that of the disks

in reality, and hence prediction errors could occur. Such a situation is most likely caused by

the deviation of disk failure rates, and the only type of error that could possibly jeopardize

data reliability is that the disk failure rates are being underestimated, so that the checking

interval is overestimated. In general, the situation of prediction errors is very similar to the

second issue. Therefore, the solutions for the second issue are also applicable to prediction

errors. In addition, the disk failure rates can be adjusted by statistics on the disks and so forth.

6.5

Evaluation of PRCR

Based on the results of several experiments conducted on both a local computer and

Amazon Web Services (AWS), in this section we evaluate PRCR from the aspects of

performance and cost-effectiveness.

Search WWH ::

Custom Search

Home