Scientific Process Automation and Workflow Management - Scientific Data Management

Database Reference

In-Depth Information

a workflow responsible for starting and monitoring jobs, both eventualities

mean that there is a set of successfully executed jobs (whose results should be

salvaged) and a set of not yet executed jobs that cannot be started because

they either depend on a failed task, or because the workflow (engine) is no

longer running and thus cannot start them. After restarting the workflow, a

previously failed task can resume.

The tasks comprising the simulation monitoring workflow represent opera-

tions carried out on individual data items (files) as the simulation produces

data at each timestep. If some operation during a particular timestep fails

(e.g., transfer to another host fails, mass storage is down at archiving time,

or a statistic cannot be created — all common failures outside the control

of the workflow engine), this should not prevent the workflow from invoking

the complete pipeline of operations over the data produced during the next

timestep. However, because a downstream actor may be affected by the re-

sult of an upstream actor, actors should be prepared for such failures. Two

possible solutions are to (1) discard from the token stream that token cor-

responding to the failed operation, or (2) introduce special “failure tokens”

to mark jobs that did not succeed. If we discard the token for the failed op-

eration, downstream actors do not receive a bad task request, and therefore

no change to the actor is required to handle them. However, the absence of

tokens changes the balance between the consumption and production rates of

the actors, and this can lead to diculties in complex workflow design, for

example, if we need to split and merge pipelines. If we replace the token with

a failure token, and downstream actors are programmed to simply ignore such

failure tokens, the workflow structure remains simpler. *

Resuming Workflow Execution Following a Fault. Pipelined (e.g.,

PN ) workflows are harder to restart than DAG workflows because the current

state of the workflow is not as easy to describe and restore; all actors in

the workflow graph may be concurrently executing. While the progress of

executing a conventional DAG workflow (Section 13.2.3) can be seen as a single

“wavefront” progressing from the beginning of the workflow DAG toward the

end, in a pipeline-parallel workflow each task can be invoked repeatedly. If

the workflow system does not support full restoration of the workflow and

actor state (a nearly impossible task when dealing with workflow components

outside the control of the engine), the workflow itself has to include some sort

of lightweight checkpoint and restart capability.

In the CPES workflow, the solution is to have the remote execution

actor — used for executing all of the actual data processing operations along

the pipeline — record all successful operations. 26 When restarted, for example,

* The first design of CPES workflows was based on approach (1), while for the above reason, an

improved design employed the second approach (2). The COMAD model of computation (see

Section 13.2.3 and 22 ) natively supports mechanisms to tag data, which is an elegant way to

achieve variant (2); it can be used to skip over or even bypass data around actors, 27

or perform

other forms of exception handling based on tags.

Scientific Data Management

Search WWH ::

Custom Search

Home