Towards Privacy for MapReduce on Hybrid Clouds Using Information Dispersal Algorithm - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

3.5 Requirements

- Ratio between n and m : As m is the number of chunks necessary and

su cient to reconstruct the input file, it is advantageous to maximize it. On

the other hand, this ratio ensures that the size of manipulated data is close

to the original file size, as extracting n chunks of length |F|/m each, from a

file size |F| gives a percentage of (( n/m∗|F| ) −|F| ) ∗ 100 /|F| of redundancy.

Otherwise, with n =2 m , we will double (100%) the source file to generate

the n files to scatter.

In our approach, 2 n∗

1) messages are exchanged. Therefore, the choice

of parameters n and m has impact on the performance of our approach.

•

( m−

if m n : we reduce communications but we weaken the security and

considerably increase redundancy.

if m ≈ n : we provide better security, we maintain an acceptable level of

redundancy but we increase communications between mappers.

- Mappers allocation: Threats can occur at the mapper itself, as a mali-

cious one, or during communications when intruders hearken the network. A

malicious mapper can have access to data as it is charged to process it. This

scenario is allowed in our system. Nevertheless, when a community of, at

least, m malicious mappers cooperate to reveal their input, they may reveal

all data, not only theirs. To prevent that, we propose to use a hybrid cloud

infrastructure to deploy our solution. On each public cloud, we deploy m−

•

1

chunks.This number of chunks is not sucient to reconstruct all the data.

The remaining chunks will be deployed on a private cloud.

There are different scenarios when taking into account the existing Cloud

providers, their cost and, the most importantly, confidence and probable

threats that may occur to each. A first given scenario may divide data be-

tween two famous Cloud such Amazon and IBM because security techniques

are more reliable, and the cost would be relatively higher. A second scenario

would choose others less trusted Clouds, so that first the cost is lower, second,

the user may allow a given level of data visibility; i.e the number of even-

tual untrusted mappers. The user could choose between different scenarios

according to his application and his data requirements.

4 Experiments and Evaluation

We have implemented our approach in Perl, to manage communication between

mappers and reducers and we have used Crypt-IDA 1 library, which is an imple-

mentation of IDA in perl.

We realized a set of experiments on the Grid'5000 platform using 220 machines

on the Nancy site.

In order to evaluate the performance of our system, we chose to evaluate

the phases according to their locality of execution, the first two phases, Split

and Scatter (step 2S), being executed by the master, the two last Collect and

1 http://search.cpan.org/~dmalone/Crypt-IDA-0.01/lib/Crypt/IDA.pm

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home