Database Reference
In-Depth Information
TABLE 17.2
Average Relative Error Rates of the Leave-One-Out Cross-
Validation and of the Testing Result on Training Data for the Four
Programs
WordCount
Sort
PageRank
TableJoin
Local
AW S
5.49%
6.46%
15.23%
15.61%
12.18%
7.92%
13.57%
14.62%
17.7 RELATED WORK
The recent research on MapReduce has been focused on understanding and improv-
ing the performance of MapReduce processing in a dedicated private Hadoop clus-
ter. The configuration parameters of Hadoop cluster are investigated in [1,7,8] to find
the optimal configuration for different types of job. In [21], the authors simulate the
steps in MapReduce processing and explore the effect of network topology, data lay-
out, and the application I/O characteristics to the performance. Job scheduling algo-
rithms in the multiuser multijob environment are also studied in [19,23,24]. These
studies have different goals from our work, but an optimal configuration of Hadoop
will reduce the amount of required resources and time for jobs running in the public
cloud as well. A theoretical study on the MapReduce programming model [12] char-
acterizes the features of mixed sequential and parallel processing in MapReduce,
which justifies our analysis in Section 17.3.
MapReduce performance prediction has been another important topic. Kambatla
et al. [10] studied the effect of the setting of map and reduce slots to the perfor-
mance and observed different MapReduce programs may have different CPU and
I/O patterns. A fingerprint-based method is used to predict the performance of a new
MapReduce program based on the studied programs. Historical execution traces of
MapReduce programs are also used for program profiling and performance predic-
tion in [13]. For long MapReduce jobs, accurate progress indication is important,
which is also studied in [16]. A strategy used by [10,13], and shared by our approach,
is to use test runs on small scale settings to characterize the behaviors of large-scale
settings. However, these approaches do not study an explicit cost function that can be
used in optimization problems.
17.8 CONCLUSION
Running MapReduce programs in the public cloud raises an important problem:
how to optimize resource provisioning to minimize the financial cost for a specific
job? To answer this question, we believe a fundamental problem is to understand the
relationship between the amount of resources and the job characteristics (e.g., input
data, processing algorithm). In this paper, we study the components in MapReduce
processing and build a cost function that explicitly models the relationship between
the amount of data, the available system resources (map and reduce slots), and the
complexity of the reduce program for the target MapReduce program. The model
 
Search WWH ::




Custom Search