Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

TABLE 17.1

Results of Regression Analysis for the In-House Cluster and AWS Clusters

WordCount

Sort

PageRank

TableJoin

Local

AWS

Local

AWS

Local

AWS

Local

AWS

51.82

20.55

25.89

37.73

47.53

3.61

β 0

28.32

54.30

0.72

21.74

12.24

10.37

12.27

20.07

β 1

0.01

0.18

β 2

9.24

14.75

β 3

4.09

3.58

6.58

1.60

3.01

β 4

26.79

β 5

0.10

0.59

0.05

0.51

0.19

β 6

0.38

β 7

R 2

0.9751

0.9524

0.9692

0.9253

0.9847

0.9733

0.9647

0.8432

Note: R 2 values higher than 0.90 indicate good fit of the proposed model.

good accuracy, which may imply the run-time environment is the main reason. The

cause of the problem will be further studied in our future work.

17.6.3.2 Prediction Accuracy

We also conduct a careful analysis on the prediction accuracy of the models. The

leave-one-out [6] cross-validation is used to identify the average prediction accuracy

and also the outliers that have low accuracy. Concretely the leave-one-out cross-

validation runs in n rounds if there are n training samples. In each round, one of the

n samples is used for testing, while the other n − 1 samples for training.

Figures 17.3 and 17.4 show the comparison between the actual running time and

the predicted running time for each sample case. The x -axis represents the actual

running time, and the y -axis the predicted time. In ideal cases, all the points will be

distributed on the line y = x , which is shown as the solid line. These figures show that

the points are very close to the ideal line, indicating excellent prediction accuracy.

We define the average accuracy as the average relative errors (ARE) over the n

rounds of testing in the cross-validation. Let C i be the real cost and Ĉ i be the esti-

mated cost by the trained model in the round i . We calculate ARE with the following

equation.

C i

−

∑

ARE

(17.19)

Intuitively, this represents the percentage of prediction error in terms of the actual

execution time. Table 17.2 shows the AREs in leave-one-out cross-validation. The

result confirms most models are robust and perform well. However, certain models

such as PageRank in the local cluster perform less effectively than others. A further

detailed study will be performed to understand the factors affecting the modeling.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home