Incremental MapReduce Computations - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

4 Incremental MapReduce

Computations

Pramod Bhatotia, Alexander Wieder,

Umut A. Acar, and Rodrigo Rodrigues

CONTENTS

4.1 Introduction .................................................................................................. 127

4.2 System Overview .......................................................................................... 129

4.2.1 Self-Adjusting Computation ............................................................. 129

4.2.2 Basic Design ..................................................................................... 130

4.2.3 Challenge: Transparency .................................................................. 131

4.2.4 Challenge: Efficiency ........................................................................ 131

4.3 Incremental HDFS ........................................................................................ 133

4.4 Incremental MapReduce ............................................................................... 134

4.5 Memoization-Aware Scheduler .................................................................... 138

4.6 Implementation and Evaluation .................................................................... 139

4.6.1 Implementation ................................................................................. 139

4.6.2 Applications ...................................................................................... 139

4.6.3 Overview of the Experiments ........................................................... 140

4.6.4 Incremental HDFS ............................................................................ 141

4.6.5 Work and Time Speedup .................................................................. 142

4.6.6 Individual Design Features ............................................................... 143

4.6.7 Overheads ......................................................................................... 14 4

4.7 Related Work ................................................................................................ 145

4.8 Conclusion .................................................................................................... 147

Acknowledgments .................................................................................................. 148

References .............................................................................................................. 148

4.1 INTRODUCTION

Distributed processing of large data sets has become an important task in the life of

various companies and organizations, for whom data analysis is an important vehicle

to improve the way they operate. This area has attracted a lot of attention from both

researchers and practitioners over the last few years, particularly after the introduc-

tion of the MapReduce paradigm for large-scale parallel data processing [19].

A usual characteristic of the data sets that are provided as inputs to large-scale

data-processing jobs is that they do not vary dramatically over time. Instead, the

same job is often invoked consecutively with small changes in this input from one

127

Search WWH ::

Custom Search

Home