Towards Privacy for MapReduce on Hybrid Clouds Using Information Dispersal Algorithm - Data Management in Cloud, Grid and P2P Systems

Database Reference

In-Depth Information

Computing aims to offer affordable and scalable computing capacities, which

meet the needs of MapReduce applications. However, because of the lack of se-

curity mechanisms to ensure data privacy provided by Cloud providers, users

are still reluctant to ooad the processing of their sensible data-sets.

Desktop Grids (DG) [CF12] are a form of volunteer computing that have

known success thanks to the high computing and storage power they offer with

a low economic cost. The architecture of this infrastructure is based on the

federation of free resources; users, voluntarily, participate with their machines

when these are idle. Volatility and security are ones of the constraints that

discourage users to exploit this enormous potential.

Our contribution is to enhance MapReduce security, so that it protects data

sent by the users to remote computing infrastructures from leakage and eaves-

dropping. More specifically users face two kinds of threats : 1) during data dis-

tribution, an eavesdropper could intercept data when being transferred, and 2)

when stored or processed, a malicious workers could have access to data. Unfor-

tunately, if encryption can protect data transfer and storage, it cannot prevent

the spying of data when they are deciphered for computation. There exists tech-

niques that allow to process encrypted data, however, those are not yet generic

enough for supporting any kind of computation.

As MapReduce is based on parallel processing, data has to be divided over

the computing nodes so each one processes a chunk as an input file. To improve

data privacy, our approach is to use a combination of trusted and untrusted

infrastructures, for instance private and public Clouds, to store the data set and

execute the MapReduce applications. Our approach relies on the Information

Dispersal Algorithm (IDA) to split and distribute the data.

Our idea is to break data into meaningless chunks so that a malicious worker

or eavesdropper, can not get access to meaningful data. A meaningless data is

an obsolete and useless information so even if a malicious worker has access

to it, the data (i.e the meaningfull) remains protected. To do so, we use IDA

which generates, from an input file, several chunks and disperses them on several

machines. Each machine aiming to access data has to contact other machines

to get missing chunks to reconstruct the needed information. In our case, we

call chunk provided by IDA: meaningless data. So, if a malicious node has 1

chunk, it has to contact and collaborate with other nodes to get missing ones.

The lack of one chunk prevents the malicious user to get access to meaningful

data. In order to hide some chunks from malicious users, we use a hybrid cloud

infrastructure. If m chunks are necessary to reconstruct data, we deploy m −

1

chunks on untrusted infrastructure, such as public cloud and desktop grid. The

remaining chunks are deployed on a private cloud. We assume that a private

cloud is highly secure and cannot be accessible by malicious users.

The rest of the paper is organized as follows. Section 2 presents the dispersion

algorithm IDA and MapReduce. In Section 3, we describe our approach with

its various components. Section 4 analyzes the experiments results. Section 5

exposes related works. Finally, we conclude in section 6.

Data Management in Cloud, Grid and P2P Systems

Search WWH ::

Custom Search

Home