Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

TABLE 17.2

Average Relative Error Rates of the Leave-One-Out Cross-

Validation and of the Testing Result on Training Data for the Four

Programs

WordCount

Sort

PageRank

TableJoin

Local

AW S

5.49%

6.46%

15.23%

15.61%

12.18%

7.92%

13.57%

14.62%

17.7 RELATED WORK

The recent research on MapReduce has been focused on understanding and improv-

ing the performance of MapReduce processing in a dedicated private Hadoop clus-

ter. The configuration parameters of Hadoop cluster are investigated in [1,7,8] to find

the optimal configuration for different types of job. In [21], the authors simulate the

steps in MapReduce processing and explore the effect of network topology, data lay-

out, and the application I/O characteristics to the performance. Job scheduling algo-

rithms in the multiuser multijob environment are also studied in [19,23,24]. These

studies have different goals from our work, but an optimal configuration of Hadoop

will reduce the amount of required resources and time for jobs running in the public

cloud as well. A theoretical study on the MapReduce programming model [12] char-

acterizes the features of mixed sequential and parallel processing in MapReduce,

which justifies our analysis in Section 17.3.

MapReduce performance prediction has been another important topic. Kambatla

et al. [10] studied the effect of the setting of map and reduce slots to the perfor-

mance and observed different MapReduce programs may have different CPU and

I/O patterns. A fingerprint-based method is used to predict the performance of a new

MapReduce program based on the studied programs. Historical execution traces of

MapReduce programs are also used for program profiling and performance predic-

tion in [13]. For long MapReduce jobs, accurate progress indication is important,

which is also studied in [16]. A strategy used by [10,13], and shared by our approach,

is to use test runs on small scale settings to characterize the behaviors of large-scale

settings. However, these approaches do not study an explicit cost function that can be

used in optimization problems.

17.8 CONCLUSION

Running MapReduce programs in the public cloud raises an important problem:

how to optimize resource provisioning to minimize the financial cost for a specific

job? To answer this question, we believe a fundamental problem is to understand the

relationship between the amount of resources and the job characteristics (e.g., input

data, processing algorithm). In this paper, we study the components in MapReduce

processing and build a cost function that explicitly models the relationship between

the amount of data, the available system resources (map and reduce slots), and the

complexity of the reduce program for the target MapReduce program. The model

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home