Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows - High Performance Computing

Information Technology Reference

In-Depth Information

Stage 4: Tasks Run-time Prediction. Consists in the generation of running time

estimates for tasks using the models constructed on the previous stage. Run-time

estimates are obtained considering the inputs of workflow tasks (i.e. parameters

and data) and the characteristics of the resources which will eventually execute

such tasks.

This sequence of stages is repeated continuously throughout the execution

of several applications. Each one of these cycles permits the improvement of

the predictive accuracy of the models. This strategy allows the adaptation of

the models to new (unseen) execution examples. The important aspect to note

is that this adaptive learning process improves the accuracy of the prediction

models without requiring human intervention more than the initial setup of

the performance data to collect. Ensemble learning plays a central role in such

objective because enables the strategy with very robust models autonomously.

3.2 Performance-Data Representation

Performance data is stored separately for each type of task. The performance

dataset for a task can be formally defined as a set

i = m ,where x ( i )

represents a column vector of features for the i th (out of m ) recorded execution

example of a task, and y ( i ) is the measured running time for such execution, also

known as target .

Each feature vector x =[ x 1 ,x 2 ,

x ( i ) ,y ( i )

D

=

{

}

,x n ] comprises three types of elements:

( i ) task features , which represent the inputs of the task, e.g. parameter val-

ues, data size, etc.; ( ii ) provenance features , describe previous processes that

generated or modified the input data; and ( iii ) resource features , which model

characteristics of the resource used on the execution of the task.

···

Task features. This kind of features describe the task's inputs. This information

includes the values taken by input parameters and characteristics of the data

such as size, number of lines, registers or columns, etc.

Provenance features. This type of features capture information of the data origin

and the transformations produced by other tasks during the execution of the

workflow. Such information can be easily extracted from the description of the

workflow. The incorporation of such information permits the obtainment of more

accurate performance models. As said before, to the extent of our knowledge,

there is no other strategy in the state of the art using such information for

producing run-time predictions.

Resource features. This kind of features describe the computing resources used

in the tasks execution. These features can be obtained from the WMS. Features

used for modeling the performance of an application are those which measure the

performance of the resources (i.e. those that impact directly on the performance

of tasks). Such information is mainly provided by resource benchmarks. In gen-

eral, most part of the WMSs provide such metrics and update them regularly.

Note that in the case of web services, this type of features will be inaccessible.

High Performance Computing

Search WWH ::

Custom Search

Home