Database Reference
In-Depth Information
17.6.3 Model Analysis ................................................................................. 550
17.6.3.1 Regression Analysis .......................................................... 550
17.6.3.2 Prediction Accuracy ........................................................... 551
17.7 Related Work ................................................................................................ 553
17.8 Conclusion .................................................................................................... 553
Acknowledgments .................................................................................................. 554
References .............................................................................................................. 554
Running MapReduce programs in the cloud introduces the important problem: how
to optimize resource provisioning to minimize the financial charge or job finish time
for a specific job. An important step toward this ultimate goal is modeling the cost of
MapReduce program. In this chapter, we study the whole process of MapReduce pro-
cessing and build up a cost function that explicitly models the relationship among the
amount of input data, the available system resources (map and reduce slots), and the
complexity of the reduce program for the target MapReduce job. The model param-
eters can be learned from test runs. Based on this cost model, we can solve a number
of decision problems, such as the optimal amount of resources that minimize the
financial cost with a job finish deadline, minimize the time under certain financial
budget, or find the optimal tradeoffs between time and financial cost. With appro-
priate modeling of energy consumption of the resources, the optimization problems
can be extended to address energy-efficient MapReduce computing. Experimental
results show that the proposed modeling approach performs well on a number of
tested MapReduce programs in both the in-house cluster and Amazon EC2.
17.1 INTRODUCTION
With the deployment of web applications, scientific computing, and sensor networks,
a large amount of data can be collected from users, applications, and the environ-
ment. For example, user clickthrough data has been an important data source for
improving web search relevance [9] and for understanding online user behaviors
[20]. Such data sets can be easily in terabyte scale; they are also continuously pro-
duced. Thus, an urgent task is to efficiently analyze these large data sets so that
the important information in the data can be promptly captured and understood.
As a flexible and scalable parallel programming and processing model, recently,
MapReduce [5] (and its open-source implementation Hadoop) has been widely used
for processing and analyzing such large-scale data sets [4,8,11,15,17,18].
On the other hand, data analysts in most companies, research institutes, and
government agencies have no luxury to access large private Hadoop/MapReduce
clouds. Therefore, running Hadoop/MapReduce on top of a public cloud has become
a realistic option for most users. In view of this requirement, Amazon has developed
Elastic MapReduce* that runs on-demand Hadoop/MapReduce clusters on top of
Amazon EC2 nodes. There are also scripts for users to manually setup Hadoop/
MapReduce on EC2 nodes.
* aws.amazon.com/elasticmapreduce/.
wiki.apache.org/hadoop/AmazonEC2.
Search WWH ::




Custom Search