Database Reference
In-Depth Information
a workflow responsible for starting and monitoring jobs, both eventualities
mean that there is a set of successfully executed jobs (whose results should be
salvaged) and a set of not yet executed jobs that cannot be started because
they either depend on a failed task, or because the workflow (engine) is no
longer running and thus cannot start them. After restarting the workflow, a
previously failed task can resume.
The tasks comprising the simulation monitoring workflow represent opera-
tions carried out on individual data items (files) as the simulation produces
data at each timestep. If some operation during a particular timestep fails
(e.g., transfer to another host fails, mass storage is down at archiving time,
or a statistic cannot be created — all common failures outside the control
of the workflow engine), this should not prevent the workflow from invoking
the complete pipeline of operations over the data produced during the next
timestep. However, because a downstream actor may be affected by the re-
sult of an upstream actor, actors should be prepared for such failures. Two
possible solutions are to (1) discard from the token stream that token cor-
responding to the failed operation, or (2) introduce special “failure tokens”
to mark jobs that did not succeed. If we discard the token for the failed op-
eration, downstream actors do not receive a bad task request, and therefore
no change to the actor is required to handle them. However, the absence of
tokens changes the balance between the consumption and production rates of
the actors, and this can lead to diculties in complex workflow design, for
example, if we need to split and merge pipelines. If we replace the token with
a failure token, and downstream actors are programmed to simply ignore such
failure tokens, the workflow structure remains simpler. *
Resuming Workflow Execution Following a Fault. Pipelined (e.g.,
PN ) workflows are harder to restart than DAG workflows because the current
state of the workflow is not as easy to describe and restore; all actors in
the workflow graph may be concurrently executing. While the progress of
executing a conventional DAG workflow (Section 13.2.3) can be seen as a single
“wavefront” progressing from the beginning of the workflow DAG toward the
end, in a pipeline-parallel workflow each task can be invoked repeatedly. If
the workflow system does not support full restoration of the workflow and
actor state (a nearly impossible task when dealing with workflow components
outside the control of the engine), the workflow itself has to include some sort
of lightweight checkpoint and restart capability.
In the CPES workflow, the solution is to have the remote execution
actor — used for executing all of the actual data processing operations along
the pipeline — record all successful operations. 26 When restarted, for example,
* The first design of CPES workflows was based on approach (1), while for the above reason, an
improved design employed the second approach (2). The COMAD model of computation (see
Section 13.2.3 and 22 ) natively supports mechanisms to tag data, which is an elegant way to
achieve variant (2); it can be used to skip over or even bypass data around actors, 27
or perform
other forms of exception handling based on tags.
Search WWH ::




Custom Search