Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

The entire paper is organized as follows. In Section 17.2, we introduce the

MapReduce Programming model and the normal setting for running Hadoop on

the public cloud. In Section 17.3, we analyze the execution of MapReduce program

and propose the cost model. In Section 17.4, we describe the statistical method to

learn the model for a specific MapReduce program. In Section 17.5, we formulate

several problems on resource provisioning as optimization problems based on the

cost model. In Section 17.6, we present the experimental results that validate the

cost model and analyze the modeling errors. In Section 17.7, the related work on

MapReduce performance analysis is briefly discussed.

17.2 PRELIMINARY

MapReduce programming for large-scale parallel data processing was recently devel-

oped by Google [5] and has become popular for big-data processing. MapReduce is

more than a programming model—it also includes the system support for processing

MapReduce jobs in parallel in a large-scale cluster. Apache Hadoop is the most popu-

lar open-source implementation of the MapReduce framework. Thus, our discus-

sions, in particular the experiments, will be based on Apache Hadoop, although the

analysis and modeling approach should also fit other MapReduce implementations.

It is better to understand how MapReduce programming works with an example—

the famous WordCount program. WordCount counts the frequency of word in a large

document collection. Its map program partitions the input lines into words and emits

tuples 〈 w , 1〉 for aggregation, where “ w ” represents a word and “1” means the occur-

rence of the word. In the reduce program, the tuples with the same word are grouped

together and their occurrences are summed up to get the final result.

Algorithm 17.1: The WordCount MapReduce Program

l: map ( ile )

2: for each line in the file do

3: for each word w in the line do

4: Emit(〈 w , 1〉)

5: end for

6: end for

1: reduce ( w , v )

2: w : word, v : list of counts.

3: d ← 0;

4: for each v i in v do

5: d ← d + v i ;

6: end for

7: Emit(〈 w , d 〉);

When deploying a Hadoop cluster in a public cloud, users need to request a num-

ber of virtual machines from the cloud and then start them with a system image that

has the Hadoop package preinstalled. Because users' data may reside in the cloud

Search WWH ::

Custom Search

Home