Database Reference
In-Depth Information
The entire paper is organized as follows. In Section 17.2, we introduce the
MapReduce Programming model and the normal setting for running Hadoop on
the public cloud. In Section 17.3, we analyze the execution of MapReduce program
and propose the cost model. In Section 17.4, we describe the statistical method to
learn the model for a specific MapReduce program. In Section 17.5, we formulate
several problems on resource provisioning as optimization problems based on the
cost model. In Section 17.6, we present the experimental results that validate the
cost model and analyze the modeling errors. In Section 17.7, the related work on
MapReduce performance analysis is briefly discussed.
17.2 PRELIMINARY
MapReduce programming for large-scale parallel data processing was recently devel-
oped by Google [5] and has become popular for big-data processing. MapReduce is
more than a programming model—it also includes the system support for processing
MapReduce jobs in parallel in a large-scale cluster. Apache Hadoop is the most popu-
lar open-source implementation of the MapReduce framework. Thus, our discus-
sions, in particular the experiments, will be based on Apache Hadoop, although the
analysis and modeling approach should also fit other MapReduce implementations.
It is better to understand how MapReduce programming works with an example—
the famous WordCount program. WordCount counts the frequency of word in a large
document collection. Its map program partitions the input lines into words and emits
tuples 〈 w , 1〉 for aggregation, where “ w ” represents a word and “1” means the occur-
rence of the word. In the reduce program, the tuples with the same word are grouped
together and their occurrences are summed up to get the final result.
Algorithm 17.1: The WordCount MapReduce Program
l: map ( ile )
2: for each line in the file do
3: for each word w in the line do
4: Emit(〈 w , 1〉)
5: end for
6: end for
1: reduce ( w , v )
2: w : word, v : list of counts.
3: d ← 0;
4: for each v i in v do
5: d d + v i ;
6: end for
7: Emit(〈 w , d 〉);
When deploying a Hadoop cluster in a public cloud, users need to request a num-
ber of virtual machines from the cloud and then start them with a system image that
has the Hadoop package preinstalled. Because users' data may reside in the cloud
Search WWH ::




Custom Search