Performance Analysis of Adapting a MapReduce Framework to Dynamically Accommodate Heterogeneity - Transactions on Large-Scale-Data-and Knowledge-Centered Systems XX

Database Reference

In-Depth Information

Performance Analysis of Adapting a MapReduce

Framework to Dynamically Accommodate

Heterogeneity

B

Jessica Hartog (

) , Renan DelValle, Madhusudhan Govindaraju,

and Michael J. Lewis

Department of Computer Science,

State University of New York (SUNY) at Binghamton,

Binghamton, NY 13902, USA

{ jhartog1,rdelval1,mgovinda,mlewis } @binghamton.edu

http://www.cs.binghamton.edu

Abstract. When data centers employ the common and economical

practice of upgrading subsets of nodes incrementally, rather than replac-

ing or upgrading all nodes at once, they end up with clusters whose nodes

have non-uniform processing capability , which we also call performance-

heterogeneity . Popular frameworks supporting the effective MapReduce

programming model for Big Data applications do not flexibly adapt to

these environments. Instead, existing MapReduce frameworks, including

Hadoop, typically divide data evenly among worker nodes, thereby induc-

ing the well-known problem of stragglers on slower nodes. Our alternative

MapReduce framework, called MARLA, divides each worker's labor into

sub-tasks, delays the binding of data to worker processes, and thereby

enables applications to run faster in performance-heterogeneous environ-

ments. This approach does introduce overhead, however. We explore and

characterize the opportunity for performance gains, and identify when

the benefits outweigh the costs. Our results suggest that frameworks

should support finer grained sub-tasking and dynamic data partitioning

when running on some performance-heterogeneous clusters. Blindly tak-

ing this approach in homogeneous clusters can slow applications down.

Our study further suggests the opportunity for cluster managers to build

performance-heterogeneous clusters by design, if they also run MapRe-

duce frameworks that can exploit them.

1

Introduction

Scientists continue to develop applications that generate, process, and ana-

lyze large amounts of data. The MapReduce programming model helps express

operations on Big Data. The model and its associated framework implementa-

tions, including Hadoop [ 1 ], successfully support applications such as genome

sequencing in bioinformatics [ 2 , 3 ], and catalog indexing of celestial objects in

This work was supported in part by NSF grant CNS-0958501.

Search WWH ::

Custom Search

Home