Scientific Process Automation and Workflow Management - Scientific Data Management

Database Reference

In-Depth Information

determine which methods and components are most suitable for the particu-

lar datasets under investigation. Such exploratory workflow design is common

when developing new analysis methods. Conversely, some applications require

the development of production workflows to be executed on a regular basis

with new datasets or simulation parameters (e.g., environmental monitoring

and analysis workflows or the fusion simulation workflow in Section 13.3).

Another important distinction has to do with what the workflow compo-

nents (called actors or tasks ) represent and model. In science-oriented work-

flows, actors model a scientific method or process. In such workflows indi-

vidual workflow steps generally are meaningful to the scientist, that is, more

or less directly correspond to high-level steps of the scientific method being

automated. Contrasting with science-oriented workflows are resource-oriented

workflows. Actors and workflow steps in the latter model require data and

resource-handling tasks rather than the science. In such cases, the actual ana-

lytical or simulation operations might be “hidden” from the workflow system,

and instead the workflow directly handles the “plumbing” tasks such as data

movement, data replication, and job management (submit, pause, resume,

abort, etc.) The simulation management workflow in Section 13.3 is an exam-

ple of such a resource-oriented “plumbing workflow.”

13.2.3 Models of Computation

Consider a workflow graph W consisting of actors (tasks, workflow steps)

and connections (directed edges) between them. * With W we can associate

a set of parameters p , input datasets x , and output datasets y .A model of

computation (MoC) M prescribes how to execute the parameterized workflow

W p on x to obtain y . Therefore, we can view a MoC as a mapping M :

W ×

P

X

Y , which for any workflow W

P , and

×

→

∈ W

, parameter settings p

∈

X uniquely determine the workflow outputs y

Y . We denote this

inputs x

∈

by y

. While most current scientific workflow systems employ a

single MoC, the Kepler system, 18 due to its heritage from Ptolemy, 16 supports

more than one such MoC: For each each MoC M , there is a corresponding

director of the same name which implements M .

For example, consider the PN (process network) model of computation.

Using the PN director in Kepler, a workflow W executes as a dataflow process

network. 19 , 20 In PN each actor executes as a separate, data-driven process (or

thread) which is continuously running. Actor connections in PN correspond

to unidirectional channels (modeled as unbounded queues) over which ordered

token streams are sent, and actors in PN block (wait) only when there are not

enough tokens available on the actor's input ports. Process networks naturally

support pipeline parallelism as well as task and data parallelism.

=

M

(

W p

(

x

))

* Here we ignore a number of details: actor ports , subworkflows “hidden” within so-called com-

posite actors, and so forth.

Scientific Data Management

Search WWH ::

Custom Search

Home