Database Reference
In-Depth Information
However, running a Hadoop cluster on top of the public cloud has different require-
ments from running a private Hadoop cluster. First, for each job normally a dedicated
Hadoop cluster will be started on a number of virtual nodes to take advantage of the
“pay-as-you-use” economical cloud model. Because users' data-processing requests
normally come in intermittently, it is not economical to maintain a constant Hadoop
cluster like private Hadoop clusters do. Instead, on-demand clusters are more appro-
priate to most users. Therefore, there is no multiuser or multijob resource competition
happening within such a Hadoop cluster. Second, it is now the user's responsibility
to set the appropriate number of virtual nodes for the Hadoop cluster. The optimal
setting may differ from application to application and depend on the amount of input
data. An effective method is needed to help the user make this decision.
The problem of optimizing resource provisioning for MapReduce programs
involves two intertwined factors: the cost of provisioning the virtual nodes and the
time to finish the job. Intuitively, with a larger amount of resources, the job can take
a shorter time to finish. However, resources are provisioned at cost, which are also
related to the amount of time for using the resources. Thus, it is tricky to find the best
setting that minimizes the financial cost. With other constraints such as a deadline
or a financial budget to finish the job, this problem appears more complicated. More
generally, energy consumption of a MapReduce program can also be modeled in a
similar way, which is critical to energy efficient cloud computing [2].
We propose a method to help users make the decision of resource provisioning
for running MapReduce programs in public clouds. This method is based on the
proposed specialized MapReduce cost model that has a number of model parameters
to be determined for a specific application. The model parameters can be learned
with test runs on a small scale of virtual nodes and small test data. Based on the
cost model and the estimated parameters, the user can find the optimal setting for
resources by solving certain optimization problems.
Our approach has several unique contributions.
Different from existing work on the performance analysis of MapReduce pro-
gram, our approach focuses on the relationship among the critical variables:
the number of Map/Reduce slots, the amount of input data, and the complexity
of application-specific components. The resulting cost model can be repre-
sented as a weighted linear combination of a set of nonlinearly functions of
these variables. Linear models provide robust generalization power that allows
one to determine the weights with the data collected on small-scale tests.
Based on this cost model, we formulate the important decision problems as
several optimization problems. The resource requirement is mapped to the
number of map/reduce slots; the financial cost of provisioning resources is
the product of the cost function and the acquired map/reduce slots. With
the explicit cost model, the resultant optimization problems are easy to for-
mulate and solve.
We have conducted a set of experiments on both the local Hadoop cluster and
Amazon EC2 to validate the cost model. The experiments show that this cost
model fits the data collected from four tested MapReduce programs very well.
The experiment on model prediction also shows low error rates.
Search WWH ::




Custom Search