Information Technology Reference
In-Depth Information
computing environments. Like so, surveyed techniques have been developed hav-
ing in mind compute-intensive applications disregarding important information
sources such the size or the structure of data, to say nothing of data prove-
nance [4] (i.e. the origin and transformations suffered by the data during the
execution of an application). In the context of scientific workflows (where data
is becoming the first-class citizen [6,10]) this information is fundamental for
achieving accurate performance predictions.
A second aspect to remark is that these strategies rely on the use of a single
model for performing the predictions. It is known that combining multiple models
usually permits achieving a higher performance than using a unique model [17].
The following list describes the main limitations of the reviewed techniques in
the light of scientific data-intensive workflows:
- Disregard of data provenance information . Attributes of the data are a cen-
tral source of information for achieving high quality task's performance pre-
dictions.
- Use of standalone models . Techniques reviewed in this section rely on a single
model for predicting the running time of tasks.
Our contribution this paper proposes a novel method for minimizing the in-
tervention of a human expert to model the performance of tasks in the
context of scientific workflows. The proposed method relies on ensemble
Machine Learning methods for generating models in a automatic fashion.
The proposed strategy incorporates several sources of information provided
by the underlying WMS, such as task parameters, hardware information,
data characteristics and provenance information to maximize the ac-
curacy of the models.
3 Learning Performance Models
This section describes a novel generic strategy for the autonomous generation
of performance models (AGPM) for the prediction of workflow tasks run-time.
Unlike other strategies, AGPM relies only on the information that can be ac-
cessible from the underlying workflow system. The user only needs to define the
meta-data of tasks that might be important for modeling their performance. In
this way, the process of performance modeling is focused on the parameters and
data that affect the performance (user's empirical knowledge) and not in the
particular process implemented by the tasks. This is one of the main advantages
of AGPM because it considers tasks as black boxes , which permits the modeling
of legacy applications or web/grid services (i.e. software components whose code
is unavailable or inaccessible).
AGPM uses machine learning (ML) techniques to model task's running time
using information of workflow tasks parameters, data and dependencies as well
as resource benchmark metrics. ML methods permit the construction of the
models and their readjustment as new performance data becomes available. In
this manner, the required human effort to maintain the models is greatly reduced
 
Search WWH ::




Custom Search