Scientific Process Automation and Workflow Management - Scientific Data Management

Database Reference

In-Depth Information

complete simulation output (as a single file) can be sent to the archive system.

Thus, the automated solution puts files into appropriately sized chunks while

taking care of other requirements, for example, ensuring that all data for one

timestep goes into the same chunk. * Finally, recording the data provenance of

all generated data becomes increasingly important as the size and complexity

of the output grows. For example, from an automatically generated diagnostic

image, a scientist must be able to easily find the output of the simulation

corresponding to the visualization. Tools can greatly help with transferring

the relevant data to the scientist's host machine (which could be at a remote

site) provided that the above simulation management workflow records the

necessary data lineage of all operations.

Pipeline Parallel Processing. An important feature of the Kepler en-

vironment is its support for the dataflow process network 19 , 20 model of com-

putation, implemented via the Process Network (PN) director. 16 Using the

PN director, all actors are running continuously in separate threads, waiting

for input to be processed immediately. Each pipeline in the above workflow

is therefore processing a stream of data items in pipeline-parallel mode. For

example, since XGC1 outputs diagnostic data into three NetCDF files at each

timestep, plots can be created for one file, a second file is being used in a

merge operation, and a third file is being transferred. In a typical production

run scenario, XGC1 outputs a new timestep every 30 seconds. The time to get

one file through the processing pipeline includes the time for recognizing its

presence, the transfer time, and the execution time of the plot generation job

on the processing cluster. If the workflow performed only one of these steps at

a time (e.g., as prescribed by the SDF director), the simulation would gener-

ate files faster than they could be processed. Due to the size of the 3D data in

the HDF5 pipeline and the longer transfer time of those files, the situation is

similar in this pipeline as well. Finally, the archiving process must obviously

work in parallel with the rest of the workflow, since it is a slow process in

itself. If the task and pipeline parallelism exhibited by the above workflow

is not enough to keep up with the flow of data, one can replicate individ-

ual actors on different compute nodes to process multiple data items at the

same time. Although the above workflow does not need to do this currently, a

more complex production workflow is in use for coupling other codes with the

XGC1 predecessor code (such as those described in Section 5), XGC0, 25 where

a parameter study has to be executed for each timestep of the simulation, and

that study is executed in this parallel mode.

Robustness of Workflows. There are two different but related aspects

of robustness that can occur in compute-intensive workflows: What happens

if the overall workflow execution fails and stops (e.g., at the workflow en-

gine level), and what happens if an individual task in the workflow fails? For

* An additional problem arises when data is generated faster than it can be archived. In this

case, an additional workflow step can be inserted that uses an auxiliary disk to queue the data,

decoupling the slow archival from the fast data generation.

Scientific Data Management

Search WWH ::

Custom Search

Home