Data Partitioning for Minimizing Transferred Data in MapReduce - Data Management in Cloud, Grid and P2P Systems

Databases Reference

In-Depth Information

Data Partitioning for Minimizing Transferred

Data in MapReduce

Miguel Liroz-Gistau 1 , Reza Akbarinia 1 , Divyakant Agrawal 2 ,

Esther Pacitti 3 , and Patrick Valduriez 1

1 INRIA & LIRMM, Montpellier, France

{ Miguel.Liroz Gistau,Reza.Akbarinia,Patrick.Valduriez } @inria.fr

2 University of California, Santa Barbara

agrawal@cs.ucsb.edu

3 University Montpellier 2, INRIA & LIRMM, Montpellier, France

Esther.Pacitti@lirmm.fr

Abstract. Reducing data transfer in MapReduce's shue phase is very

important because it increases data locality of reduce tasks, and thus

decreases the overhead of job executions. In the literature, several op-

timizations have been proposed to reduce data transfer between map-

pers and reducers. Nevertheless, all these approaches are limited by how

intermediate key-value pairs are distributed over map outputs. In this

paper, we address the problem of high data transfers in MapReduce,

and propose a technique that repartitions tuples of the input datasets,

and thereby optimizes the distribution of key-values over mappers, and

increases the data locality in reduce tasks. Our approach captures the

relationships between input tuples and intermediate keys by monitoring

the execution of a set of MapReduce jobs which are representative of

the workload. Then, based on those relationships, it assigns input tuples

to the appropriate chunks. We evaluated our approach through experi-

mentation in a Hadoop deployment on top of Grid5000 using standard

benchmarks. The results show high reduction in data transfer during the

shue phase compared to Native Hadoop.

1 Introduction

MapReduce [4] has established itself as one of the most popular alternatives

for big data processing due to its programming model simplicity and automatic

management of parallel execution in clusters of machines. Initially proposed by

Google to be used for indexing the web, it has been applied to a wide range

of problems having to process big quantities of data, favored by the popularity

of Hadoop [2], an open-source implementation. MapReduce divides the compu-

tation in two main phases, namely map and reduce, which in turn are carried

out by several tasks that process the data in parallel. Between them, there is

a phase, called shue, where the data produced by the map phase is ordered,

partitioned and transferred to the appropriate machines executing the reduce

phase.

Search WWH ::

Custom Search

Home