Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

17.6.3 Model Analysis ................................................................................. 550

17.6.3.1 Regression Analysis .......................................................... 550

17.6.3.2 Prediction Accuracy ........................................................... 551

17.7 Related Work ................................................................................................ 553

17.8 Conclusion .................................................................................................... 553

Acknowledgments .................................................................................................. 554

References .............................................................................................................. 554

Running MapReduce programs in the cloud introduces the important problem: how

to optimize resource provisioning to minimize the financial charge or job finish time

for a specific job. An important step toward this ultimate goal is modeling the cost of

MapReduce program. In this chapter, we study the whole process of MapReduce pro-

cessing and build up a cost function that explicitly models the relationship among the

amount of input data, the available system resources (map and reduce slots), and the

complexity of the reduce program for the target MapReduce job. The model param-

eters can be learned from test runs. Based on this cost model, we can solve a number

of decision problems, such as the optimal amount of resources that minimize the

financial cost with a job finish deadline, minimize the time under certain financial

budget, or find the optimal tradeoffs between time and financial cost. With appro-

priate modeling of energy consumption of the resources, the optimization problems

can be extended to address energy-efficient MapReduce computing. Experimental

results show that the proposed modeling approach performs well on a number of

tested MapReduce programs in both the in-house cluster and Amazon EC2.

17.1 INTRODUCTION

With the deployment of web applications, scientific computing, and sensor networks,

a large amount of data can be collected from users, applications, and the environ-

ment. For example, user clickthrough data has been an important data source for

improving web search relevance [9] and for understanding online user behaviors

[20]. Such data sets can be easily in terabyte scale; they are also continuously pro-

duced. Thus, an urgent task is to efficiently analyze these large data sets so that

the important information in the data can be promptly captured and understood.

As a flexible and scalable parallel programming and processing model, recently,

MapReduce [5] (and its open-source implementation Hadoop) has been widely used

for processing and analyzing such large-scale data sets [4,8,11,15,17,18].

On the other hand, data analysts in most companies, research institutes, and

government agencies have no luxury to access large private Hadoop/MapReduce

clouds. Therefore, running Hadoop/MapReduce on top of a public cloud has become

a realistic option for most users. In view of this requirement, Amazon has developed

Elastic MapReduce* that runs on-demand Hadoop/MapReduce clusters on top of

Amazon EC2 nodes. There are also scripts † for users to manually setup Hadoop/

MapReduce on EC2 nodes.

* aws.amazon.com/elasticmapreduce/.

† wiki.apache.org/hadoop/AmazonEC2.

Search WWH ::

Custom Search

Home