Scientific Process Automation and Workflow Management - Scientific Data Management

Database Reference

In-Depth Information

The primary task of a scientific workflow system is to automate the ex-

ecution of scientific workflows. Scientific workflow systems may additionally

support users in the design, composition, and verification of scientific work-

flows. They also may include support for monitoring the execution of work-

flows in real time; recording the processing history of data; planning resource

allocation in distributed execution environments; discovering existing work-

flows and workflow components; recording the lineage of data and evolution of

workflows; and generally managing scientific data. Thus, a scientific workflow

system primarily serves as a workflow execution engine , but may also include

features of problem-solving environments (PSE). 10

Wainer et al. describe some of the differences between business (or “of-

fice automation”) workflows and scientific workflows, stating, “whereas oce

work is about goals, scientific work is about data”. 11 Business workflows are

mainly concerned with the modeling of business rules, policies, and case man-

agement, and therefore are often control- and activity-oriented. In contrast, to

support the work of computational scientists, scientific workflows are mainly

concerned with capturing scientific data analysis or simulation processes and

the associated management of data and computational resources. While scien-

tific workflow technology and research can inherit and adopt techniques from

the field of business workflows, there are several, sometimes subtle, differences

ranging from the modeling paradigms used to the underlying computation

models employed to execute workflows. 86 For example, scientific workflows

are usually dataflow-oriented “analysis pipelines” that often exhibit pipeline

parallelism over data streams in addition to supporting the data parallelism

and task parallelism common in business workflows. * In some cases (for ex-

ample, in seismic or geospatial data processing 12 ), scientific workflows execute

as digital signal processing (DSP) pipelines. In contrast, traditional workflows

often deal with case management (for example, insurance claims, mortgage

applications), tend to be more control-intensive, and lend themselves to very

different models of computation.

In Section 13.2 we introduce basic concepts and describe key characteris-

tics of scientific workflows. In Section 13.3 we provide a detailed case study

from a fusion simulation project where scientific workflows are used to man-

age complex scientific simulations. Section 13.4 describes scientific workflow

systems currently in use and in development. Section 13.5 introduces and dis-

cusses basic notions of data and workflow provenance in the scientific workflow

context, and describes how workflow systems monitor execution and manage

provenance. Finally, Section 13.6 describes approaches for enabling workflow

reuse, sharing, and collaboration.

* In the parallel computing literature, task parallelism refers to distributing tasks (processes)

across different parallel computing nodes, and data parallelism involves distributing data across

multiple nodes. Pipeline parallelism is a more specific condition that arises whenever multiple

processes arranged in a linear sequence execute simultaneously.

Scientific Data Management

Search WWH ::

Custom Search

Home