Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

However, running a Hadoop cluster on top of the public cloud has different require-

ments from running a private Hadoop cluster. First, for each job normally a dedicated

Hadoop cluster will be started on a number of virtual nodes to take advantage of the

“pay-as-you-use” economical cloud model. Because users' data-processing requests

normally come in intermittently, it is not economical to maintain a constant Hadoop

cluster like private Hadoop clusters do. Instead, on-demand clusters are more appro-

priate to most users. Therefore, there is no multiuser or multijob resource competition

happening within such a Hadoop cluster. Second, it is now the user's responsibility

to set the appropriate number of virtual nodes for the Hadoop cluster. The optimal

setting may differ from application to application and depend on the amount of input

data. An effective method is needed to help the user make this decision.

The problem of optimizing resource provisioning for MapReduce programs

involves two intertwined factors: the cost of provisioning the virtual nodes and the

time to finish the job. Intuitively, with a larger amount of resources, the job can take

a shorter time to finish. However, resources are provisioned at cost, which are also

related to the amount of time for using the resources. Thus, it is tricky to find the best

setting that minimizes the financial cost. With other constraints such as a deadline

or a financial budget to finish the job, this problem appears more complicated. More

generally, energy consumption of a MapReduce program can also be modeled in a

similar way, which is critical to energy efficient cloud computing [2].

We propose a method to help users make the decision of resource provisioning

for running MapReduce programs in public clouds. This method is based on the

proposed specialized MapReduce cost model that has a number of model parameters

to be determined for a specific application. The model parameters can be learned

with test runs on a small scale of virtual nodes and small test data. Based on the

cost model and the estimated parameters, the user can find the optimal setting for

resources by solving certain optimization problems.

Our approach has several unique contributions.

•

Different from existing work on the performance analysis of MapReduce pro-

gram, our approach focuses on the relationship among the critical variables:

the number of Map/Reduce slots, the amount of input data, and the complexity

of application-specific components. The resulting cost model can be repre-

sented as a weighted linear combination of a set of nonlinearly functions of

these variables. Linear models provide robust generalization power that allows

one to determine the weights with the data collected on small-scale tests.

•

Based on this cost model, we formulate the important decision problems as

several optimization problems. The resource requirement is mapped to the

number of map/reduce slots; the financial cost of provisioning resources is

the product of the cost function and the acquired map/reduce slots. With

the explicit cost model, the resultant optimization problems are easy to for-

mulate and solve.

•

We have conducted a set of experiments on both the local Hadoop cluster and

Amazon EC2 to validate the cost model. The experiments show that this cost

model fits the data collected from four tested MapReduce programs very well.

The experiment on model prediction also shows low error rates.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home