Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

should be created and in the cost model. In the linear case, which is common as we

have observed, the cost model can be further simplified to

M

m

MR

m

R

TMmR

(,,)

=+ +

ββ β

+

β

2

0

1

2

3

(17.10)

MM

R

log

+

β

+

β β++

MR

/

+

M

β 7 R  ,

4

5

6

17.4 LEARNING THE MODEL

With the formulation of the cost function in terms of input variables M , m , and R , we

need to learn the parameters β i . Note that β i should be different from application to

application. We design a learning procedure as follows.

First, for a specific MapReduce program, we randomly choose the variables M , m ,

and R from certain ranges. For example, m and R (i.e., r ) are chosen within 50; M is

chosen so that at least two rounds of map processes are available for testing. Second,

we collect the time cost of the test run of the MapReduce job for each setting of ( M ,

m , R ), which forms the training data set. Third, regression modeling [14] is applied

to learn the model from the training data with the transformed variables

x 1 = M / m , x 2 = MR / m , x 3 = m / R , x 4 = ( M log M )/ R , x 5 = M / R , x 6 = M , x 7 = R . (17.11)

Because β i has practical meaning, that is, the weights of the components in the total

cost, we have β i ≥ 0, i = 0… r , which requires non-negative linear regression [14] to

solve the learning problem. The cross-validation method [6] is then used to validate

the performance of the learned model. We will show more details in experiments.

17.5 OPTIMIZATION OF RESOURCE PROVISIONING

With the cost model we are now ready to find the optimal settings for different deci-

sion problems. We try to find the best resource allocation for three typical situations:

(1) with a certain limited amount of financial budget; (2) with a time constraint; and

(3) the optimal tradeoff curve without any constraint. In the following, we formulate

these problems as optimization problems based on the cost model.

In all the scenarios we consider, we assume the model parameters β i have been

learned with sample runs in small scale settings. For the simplicity of presentation,

we assume the simplified model T 2 (Equation 17.10) is applied. Cost models with

other reduce complexity do not change the optimization algorithm. Since the input

data is fixed for a specific MapReduce job, M is a constant. We also consider all

general MapReduce system configurations have been optimized via other methods

[1,7,8] and fixed for both small- and large-scale settings. With this setup, the time

cost function becomes

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home