Cost-effective data reliability assurance for data maintenance - Reliability Assurance of Big Data in the Cloud

Database Reference

In-Depth Information

Therefore, each data file should be managed by the PRCR node with a scan cycle of

the proper length. The scan cycle constraint of the PRCR node could lead to certain

underutilization of PRCR.

To maximize the utilization of PRCR while providing sufficient data reliability as-

surance to the data files, according to the checking interval values of the data files and

the scan cycles of PRCR nodes, the metadata distribution algorithm distributes the

metadata of each data file to the most appropriate PRCR node. The principle of the

algorithm is simple: It compares the checking interval values of the data file with the

scan cycle of each PRCR node. Among the PRCR nodes with a scan cycle smaller than

the checking interval values of the data file, the metadata are distributed to the node (or

a random one of several nodes) that has the biggest scan cycle. The difference between

the scan cycle of a PRCR node and the checking interval of the data file indicates the

length of time for which the proactive replica checking task is conducted before the

checking interval is reached. When this difference is minimized, the metadata scanning

and proactive replica checking tasks can be least frequently conducted to each data file,

so that the number of data files that a PRCR node is able to manage can be maximized.

The following presents the proof of the effectiveness of the metadata distribution

algorithm:

Theorem . Given multiple PRCR nodes with different scan cycles, the distribution of meta-

data following the metadata distribution algorithm maximizes the utilization of all the PRCR

nodes.

Proof . Assume that all PRCR nodes reach the maximum capacity while all the meta-

data are distributed by following the metadata distribution algorithm. Therefore, for

any data ile f maintained by PRCR node A and any other PRCR node I with scan

cycle bigger than A, let CI () be the minimum checking interval of data ile f , we

have

() () () . Without losing generality, we randomly create

another metadata distribution other than the current one by swapping the metadata of a pair

of data iles. Assume two PRCR nodes B and C, in which

ScanCycleA CI f canCycleI

≤

<

() ( .

Assume that data iles f 1 and f 2 be managed by PRCR node B and PRCR node C respec-

tively. Swap their managing PRCR nodes. Since

ScanCycleB ScanCycleC

>

() ( 2 , the data reliability

requirement of f 2 cannot be met. Therefore, data ile f 2 cannot be managed by PRCR by fol-

lowing the new metadata distribution. Therefore, the utilization of PRCR nodes by following

this new distribution is lower than that by following the metadata distribution algorithm.

According to the preceding reasoning, it can be deduced that there is no other metadata

distribution that has higher utilization. Hence, the theorem holds.

CI f canCycleB

<

Figure 6.4 shows the pseudo code of the metadata distribution algorithm. In the

figure, CI indicates the minimum checking interval of the data file. S indicates the

set of all the PRCR nodes. The algorithm first calculates the differences between CI

and the scan cycles of all available PRCR nodes (lines 2-3). Then, from all the PRCR

nodes with a scan cycle smaller than CI , the ones with the smallest difference val-

ues are selected as the candidates of the destination node (lines 4-6). Finally, one of

the candidates is randomly chosen as the destination node (line 7). The reason for

randomly choosing one node from the node set is to deal with the situation where

multiple PRCR nodes have the same scan cycle. The metadata distribution algorithm

is able to effectively optimize the utilization of all the PRCR nodes. However, in

Reliability Assurance of Big Data in the Cloud

Search WWH ::

Custom Search

Home