Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows - High Performance Computing

Information Technology Reference

In-Depth Information

For managing the applications eciently, WMSs rely on run-time estimates

of tasks. This information is the basis for several processes like for example:

tasks scheduling, fulfillment of Quality of Service (QoS) requirements, autoscal-

ing cloud infrastructures among others [3,5,9].

Most of the prediction methods used by WMSs were crafted for characteriz-

ing parallel applications. Although such techniques provide accurate predictions,

they require the supervision of an expert for constructing and tuning the pre-

diction models. Such requirements lure one of the main advantages of workflow

technology: simplicity for the user .

To cope with such limitation many authors applied Machine Learning strate-

gies to generate the prediction models (semi-)automatically. Following this line

of thought, we propose a novel method for the autonomous generation of mul-

tiple combined run-time prediction models derived using Ensemble Learning

methods. The final objective of our approach is the minimization of the human

effort when generating the models without handicapping the accuracy of pre-

dictions. For accomplishing such objective this work utilizes the performance

information available in WMSs and workflow provenance information to learn

robust combined models.

The rest of this paper is organized as follows. In section 2 we provide a review

of performance prediction strategies based on Machine Learning methods. Sec-

tion 3 presents the proposed approach for learning run-time prediction models.

Section 4 describes a set of Bioinformatic workflows and the methodology used

for validating our proposal. Section 5 presents and discusses the results obtained.

Finally, conclusions and future work are given in section 6.

2 Related Works

The prediction of application's performance has been studied since the genesis

of parallel and distributed computing [1]. Many of such strategies use historical

data to carry out the predictions instead of constructing the models by hand.

Statistical and Machine Learning techniques permit the derivation of models

based on the available historical data (examples). This approach supposes an

important advantage for workflow applications executing on Grid or Cloud envi-

ronments because models can be refined over time and the user does not need to

be supervising the construction of the models or performing tedious tasks such

as benchmarking resources, profiling applications, etc.

Some of these strategies address the prediction issue using the k-Nearest

Neighbors strategy [8,11]. Predictions are performed by first looking execution

examples with similar settings to the prediction query (e.g. examples with similar

task parameters, processor speed, etc.). Then, the execution times correspond-

ing to the selected examples are averaged and returned as the prediction. Other

authors use methods such as regression trees for predicting the performance of

applications [12]. More recently, Artificial Neural Networks have been applied to

estimate the price of market-based computing resources [15].

Mentioned strategies apply statistical or machine-learning methods to pre-

dict several aspects of the execution of applications in the context of distributed

High Performance Computing

Search WWH ::

Custom Search

Home