Databases Reference
In-Depth Information
Data Partitioning for Minimizing Transferred
Data in MapReduce
Miguel Liroz-Gistau 1 , Reza Akbarinia 1 , Divyakant Agrawal 2 ,
Esther Pacitti 3 , and Patrick Valduriez 1
1 INRIA & LIRMM, Montpellier, France
{ Miguel.Liroz Gistau,Reza.Akbarinia,Patrick.Valduriez } @inria.fr
2 University of California, Santa Barbara
agrawal@cs.ucsb.edu
3 University Montpellier 2, INRIA & LIRMM, Montpellier, France
Esther.Pacitti@lirmm.fr
Abstract. Reducing data transfer in MapReduce's shue phase is very
important because it increases data locality of reduce tasks, and thus
decreases the overhead of job executions. In the literature, several op-
timizations have been proposed to reduce data transfer between map-
pers and reducers. Nevertheless, all these approaches are limited by how
intermediate key-value pairs are distributed over map outputs. In this
paper, we address the problem of high data transfers in MapReduce,
and propose a technique that repartitions tuples of the input datasets,
and thereby optimizes the distribution of key-values over mappers, and
increases the data locality in reduce tasks. Our approach captures the
relationships between input tuples and intermediate keys by monitoring
the execution of a set of MapReduce jobs which are representative of
the workload. Then, based on those relationships, it assigns input tuples
to the appropriate chunks. We evaluated our approach through experi-
mentation in a Hadoop deployment on top of Grid5000 using standard
benchmarks. The results show high reduction in data transfer during the
shue phase compared to Native Hadoop.
1 Introduction
MapReduce [4] has established itself as one of the most popular alternatives
for big data processing due to its programming model simplicity and automatic
management of parallel execution in clusters of machines. Initially proposed by
Google to be used for indexing the web, it has been applied to a wide range
of problems having to process big quantities of data, favored by the popularity
of Hadoop [2], an open-source implementation. MapReduce divides the compu-
tation in two main phases, namely map and reduce, which in turn are carried
out by several tasks that process the data in parallel. Between them, there is
a phase, called shue, where the data produced by the map phase is ordered,
partitioned and transferred to the appropriate machines executing the reduce
phase.
 
Search WWH ::




Custom Search