Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows - High Performance Computing

Information Technology Reference

In-Depth Information

k-Nearest Neighbors (k-NN). In this strategy, training examples are stored ver-

batim. A distance function is used to determine which are the k examples of the

training set that are closest to an unknown test example [17]. The output of the

method is the average of the target values (i.e. running time) corresponding to

those k nearest examples. In this study we use the Euclidean distance. Given two

examples x (1) and x (2) with components x (1 1 ,x (1 2 ,...,x (1 n and x (2 1 ,x (2 2 ,...,x (2)

respectively, the Euclidean distance between them is: 1 ( x (1)

x (2)

−

) 2 .

Support Vector Regression (SVR). SVR [17] is an adaptation of the Support

Vector Machines (SVM) to deal with the prediction of numeric classes. Produced

models can be expressed in terms of a few support vectors that best describe

the prediction surface. The SVR model has the form f ( x )=

i∈SV

ʱ i K ( x ( i ) ,x )+

b, where SV is the set of support vectors and K ( x ( i ) ,x ) is a kernel function that

maps an example into a feature space of higher dimensions. ʱ i and b are model

parameters determined by solving the following optimization problem:

i =1 ( ʾ i + ʾ i ) subjectto :

2 ʱ T ʱ + C

−

f ( x i )

≤

ʵ + ʾ i

min

ʱ,b,ʾ i ,ʾ i

ʵ + ʾ i

f ( x i )

−

≤

ʾ i ,ʾ i > 0

where C is the model complexity parameter which penalizes the loss of train-

ing errors, ʾ i and ʾ i are slack variables that specify upper and lower bound

training errors subject to an error tolerance ʵ . To model non-linear functions

of the running time we use a radial basis function (RBF) kernel k ( x ( i ) ,x ( j ) )=

exp (

ʳ x ( i )

x ( j )

) with ʳ =0 . 01. The values parameters of SVR were set

to C =1and ʵ =0 . 001, which are the default values in Weka.

−

M5P Regression trees. The M5 Prime (M5P) algorithm [17] permits the induc-

tion of decision trees whose leaves are associated to regression models. M5P

trees are a combination of decision trees and linear models in the tree leaves.

The model is constructed in three phases as follows. First, a decision tree is

constructed minimizing the variability the output variable (i.e. examples with

very similar running time are grouped). Then, the tree is pruned starting from

the leaf nodes. If an inner node is pruned, it is converted into a leaf node and a

regression plane is associated. Finally, a smoothing function is applied to avoid

sharp discontinuities between the hyperplanes. An example of M5P regression

tree is given in Figure 3b.

Preprocessing. Before constructing the models, the available performance data

is normalized to avoid the dominance of some features of higher orders of magni-

tude in the construction of the models. To such end, each feature x i is transformed

intoanewfeature x i according to the equation x i = x i −μ

,where μ and ˃ are the

High Performance Computing

Search WWH ::

Custom Search

Home