Database Reference
In-Depth Information
resources such as those managed by Portable Batch System (PBS), Load
Sharing Facility (LSF), Condor, 54 and individual machines. Authentication
to remote resources is done via Grid Security Infrastructure (GSI). 55 During
workflow execution, Pegasus captures provenance information about the exe-
cuted tasks. Provenance includes a variety of information including the hosts
where tasks executed, task runtimes, environment variables, etc. Pegasus uses
the DAGMan workflow engine for execution (Figure 13.3). DAGMan interfaces
in turn to a local Condor queue managed by a scheduler daemon. DAGMan
uses the scheduler's API and logs to submit, query, and manipulate jobs, and
does not directly interact with jobs. DAGMan can also use Condor's grid
abilities (Condor-G) to submit jobs to many other batch and grid systems.
DAGMan reads the logs of the underlying batch system to follow the status of
submitted jobs rather than invoking interactive tools or service APIs. By re-
lying on file-based I/O, DAGMan's implementation can be simpler and more
scalable and reliable across many platforms, and therefore more robust. For
example, if DAGMan has crashed while the underlying batch system contin-
ues to run jobs, DAGMan can recover its state upon restart (by reading logs
provided by the batch system) without losing information about the executing
workflow. DAGMan workflow management includes not only job submission
and monitoring but also job preparation, cleanup, throttling, retry, and other
actions necessary to ensure successful workflow execution. DAGMan attempts
to overcome or work around as many execution errors as possible; and in the
face of errors it cannot overcome, it provides a Rescue DAG * and allows the
user to resolve the problem manually and then resume the workflow from
the point where it last left off. This can be thought of as a “checkpoint-
ing” of the workflow, just as some batch systems provide checkpointing of
jobs.
Triana supports job-level execution through GAT integration, which can
make use of job execution components such as GRMS, 52 GRAM 56 or Condor 37
for the actual job submission. It also supports service-level execution through
the GAP bindings to Web, WSRF and P2P services. During execution, Triana
will identify failures for components and provide feedback to the user if a com-
ponent fails. Triana does not contain failsafe mechanisms within the system
for, e.g., retrying a service, however.
As discussed in Section 13.2.3, execution of a Kepler workflow is man-
aged through an independent component called a director , which is in charge
of workflow scheduling and execution. A director in Kepler encapsulates a
model of computation (MoC) and a scheduling algorithm, which allows the
same workflow to be executed in different ways depending on which workflow
director/MoC is used. Kepler ships with some common MoCs, such as SDF ,
PN , and DDF (see Section 13.2.3 for more details).
* http://www.cs.wisc.edu/condor/manual/v7.0/2 10DAGMan Applications.html
or orchestration and choreography in Web service parlance
Search WWH ::




Custom Search