Database Reference
In-Depth Information
4 Incremental MapReduce
Computations
Pramod Bhatotia, Alexander Wieder,
Umut A. Acar, and Rodrigo Rodrigues
CONTENTS
4.1 Introduction .................................................................................................. 127
4.2 System Overview .......................................................................................... 129
4.2.1 Self-Adjusting Computation ............................................................. 129
4.2.2 Basic Design ..................................................................................... 130
4.2.3 Challenge: Transparency .................................................................. 131
4.2.4 Challenge: Efficiency ........................................................................ 131
4.3 Incremental HDFS ........................................................................................ 133
4.4 Incremental MapReduce ............................................................................... 134
4.5 Memoization-Aware Scheduler .................................................................... 138
4.6 Implementation and Evaluation .................................................................... 139
4.6.1 Implementation ................................................................................. 139
4.6.2 Applications ...................................................................................... 139
4.6.3 Overview of the Experiments ........................................................... 140
4.6.4 Incremental HDFS ............................................................................ 141
4.6.5 Work and Time Speedup .................................................................. 142
4.6.6 Individual Design Features ............................................................... 143
4.6.7 Overheads ......................................................................................... 14 4
4.7 Related Work ................................................................................................ 145
4.8 Conclusion .................................................................................................... 147
Acknowledgments .................................................................................................. 148
References .............................................................................................................. 148
4.1 INTRODUCTION
Distributed processing of large data sets has become an important task in the life of
various companies and organizations, for whom data analysis is an important vehicle
to improve the way they operate. This area has attracted a lot of attention from both
researchers and practitioners over the last few years, particularly after the introduc-
tion of the MapReduce paradigm for large-scale parallel data processing [19].
A usual characteristic of the data sets that are provided as inputs to large-scale
data-processing jobs is that they do not vary dramatically over time. Instead, the
same job is often invoked consecutively with small changes in this input from one
127
 
Search WWH ::




Custom Search