Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows - High Performance Computing

Information Technology Reference

In-Depth Information

computing environments. Like so, surveyed techniques have been developed hav-

ing in mind compute-intensive applications disregarding important information

sources such the size or the structure of data, to say nothing of data prove-

nance [4] (i.e. the origin and transformations suffered by the data during the

execution of an application). In the context of scientific workflows (where data

is becoming the first-class citizen [6,10]) this information is fundamental for

achieving accurate performance predictions.

A second aspect to remark is that these strategies rely on the use of a single

model for performing the predictions. It is known that combining multiple models

usually permits achieving a higher performance than using a unique model [17].

The following list describes the main limitations of the reviewed techniques in

the light of scientific data-intensive workflows:

- Disregard of data provenance information . Attributes of the data are a cen-

tral source of information for achieving high quality task's performance pre-

dictions.

- Use of standalone models . Techniques reviewed in this section rely on a single

model for predicting the running time of tasks.

Our contribution this paper proposes a novel method for minimizing the in-

tervention of a human expert to model the performance of tasks in the

context of scientific workflows. The proposed method relies on ensemble

Machine Learning methods for generating models in a automatic fashion.

The proposed strategy incorporates several sources of information provided

by the underlying WMS, such as task parameters, hardware information,

data characteristics and provenance information to maximize the ac-

curacy of the models.

3 Learning Performance Models

This section describes a novel generic strategy for the autonomous generation

of performance models (AGPM) for the prediction of workflow tasks run-time.

Unlike other strategies, AGPM relies only on the information that can be ac-

cessible from the underlying workflow system. The user only needs to define the

meta-data of tasks that might be important for modeling their performance. In

this way, the process of performance modeling is focused on the parameters and

data that affect the performance (user's empirical knowledge) and not in the

particular process implemented by the tasks. This is one of the main advantages

of AGPM because it considers tasks as black boxes , which permits the modeling

of legacy applications or web/grid services (i.e. software components whose code

is unavailable or inaccessible).

AGPM uses machine learning (ML) techniques to model task's running time

using information of workflow tasks parameters, data and dependencies as well

as resource benchmark metrics. ML methods permit the construction of the

models and their readjustment as new performance data becomes available. In

this manner, the required human effort to maintain the models is greatly reduced

Search WWH ::

Custom Search

Home