Repair Time in Distributed Storage Systems - Data Management in Cloud, Grid and P2P Systems - page 108

Databases Reference

In-Depth Information

Experimentation g5k

Simulation

(Experim.) Mean = 148 seconds Std.Dev. = 76

(Sim) Mean = 145 seconds Std.Dev = 81

0

100

200

300

400

500

Reconstruction Time (seconds)

Fig. 6. Distribution of reconstruction time in an 64 nodes during 4 hours experiment

compared to simulation

Storage System Description. In few words, the system is made of a storage

layer (upper layer) built on top of the DHT layer (lower layer) running Pas-

try [13]. The lower layer is in charge of managing the logical topology: finding

devices, routing, alerting of device arrivals or departures. The upper layer is in

charge of storing and monitoring the data.

Storing the Data. The system uses Reed-Solomon erasure codes [15] to intro-

duce redundancy. Each data block has a device responsible of monitoring it. This

device keeps a list of the devices storing a fragment of the block. The fragments

of the blocks are stored locally on the Pastry leafset of the device in charge [16].

Monitoring the System. The storage system uses the information given by the

lower level to discover device failures. In Pastry, a device checks periodically if

the members of its leafset are still up and running. When the upper layer receives

a message that a device left, the device in charge updates its block status.

Monitored Metrics. The application monitors and keep statistics on the

amount of data stored on its disks, the number of performed reconstructions

along with their duration, the number of dead blocks that cannot be recon-

structed. The upload and download bandwidth of devices can be adjusted.

Results. There exist a lot of different storage systems with different parameters

and different reconstruction processes. The goal of the paper is not to precisely

tune a model to a specific one, but to provide a general analytical framework to

be able to predict any storage system behavior. Hence, we are more interested

here by the global behavior of the metrics than by their absolute values.

Studied Scenario. By using simulations we can easily evaluate several years of a

system, however it is not the case for experimentation. Time available for a simple

experiment is constrained to a few hours. Hence, we define an acceleration factor ,

as the ratio between experiment duration and the time of real system we want

to imitate. Our goal is to check the bandwidth congestion in a real environment.

Thus, we decided to shrink the disk size (e.g., from 10 GB to 100 MB, a reduction

of 100

), inducing a much smaller time to repair a failed disk. Then, the device

failure rate is increased (from months to a few hours) to keep the ratio between

disk failures and repair time proportional. The bandwidth limit value, however,

is kept close to the one of a “real” system. The idea is to avoid inducing strange

behaviors due to very small packets being transmitted in the network.

×

Next Page

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home