Information Technology Reference
In-Depth Information
Ensemble Learning of Run-Time Prediction
Models for Data-Intensive Scientific Workflows
David A. Monge 1 , 2 ,Matej Holec 3 , Filip Zelezny 3 , and Carlos Garcıa Garino 1 , 4
1 ITIC Research Institute, National University of Cuyo (UNCuyo), Argentina
2 Faculty of Exact and Natural Sciences, UNCuyo, Argentina
3 IDA Research Group, Czech Technical University, Czech Republic
4 Faculty of Engineering, UNCuyo, Argentina
{ dmonge,cgarcia } @itu.uncu.edu.ar ,
{ holecmat,zelezny } @fel.cvut.cz
Abstract. Workflow applications for in-silico experimentation involve
the processing of large amounts of data. One of the core issues for the
ecient management of such applications is the prediction of tasks per-
formance. This paper proposes a novel approach that enables the con-
struction models for predicting task's running-times of data-intensive
scientific workflows. Ensemble Machine Learning techniques are used to
produce robust combined models with high predictive accuracy. Informa-
tion derived from workflow systems and the characteristics and prove-
nance of the data are exploited to guarantee the accuracy of the models.
The proposed approach has been tested on Bioinformatics workflows for
Gene Expressions Analysis over homogeneous and heterogeneous com-
puting environments. Obtained results highlight the convenience of using
ensemble models in comparison with single/standalone prediction mod-
els. Ensemble learning techniques permitted reductions of the prediction
error up to 24.9% in comparison with single-model strategies.
Keywords: Performance prediction, Scientific workflows, Ensemble
Learning, Data Provenance, Data-intensive computing.
1 Introduction
Workflow technology is intended to ease the development of applications through
the combination of reusable software components. This approach facilitates the
development of large-scale applications by people with low or even null experi-
ence on programming languages. For such reason, workflow technology has been
widely accepted on many scientific areas [13].
Scientific data-intensive computing is in vogue nowadays [6]. In this sense,
workflows are used to describe large-scale applications, whose execution is dele-
gated to Workflow Management Systems (WMSs) that take in the details of the
underlying computing infrastructure. This aspect is very important for executing
large-scale applications because the users can take advantage of a huge comput-
ing power (i.e. clusters, grids or clouds) abstracting them from the particularities
of the underlying infrastructure.
Search WWH ::




Custom Search