Scientific Process Automation and Workflow Management - Scientific Data Management

Database Reference

In-Depth Information

possible to use session logs generated by software tools to capture provenance

information, 61 these logs may not be represented in a format that can be easily

queried; and further, they may require sophisticated programming techniques

to uncover the information needed to answer the above questions related to

data lineage.

Similar to a workflow specification, the provenance of a workflow run is of-

ten represented using graph structures in which nodes represent processes and

data products, and edges capture the flow of data between processes. 60 , 62 Some

approaches support additional graph representations, for example, by record-

ing explicit process and data dependencies. 63 , 64 A complete description of

provenance models and capture mechanisms is beyond the scope of this chap-

ter (see, for example, Simmhan et al. 55 and Davidson and Freire 60 ). Instead, we

concentrate on describing one particular scheme of storing provenance infor-

mation that combines features from Kepler, 18 VisTrails, 65 and Pegasus. 53 We

first describe the different types of provenance information considered by these

approaches, and then discuss the current implementation of the provenance

framework used by the Scientific Data Management (SDM) Center. 66

Types of Provenance Information. Provenance information related to

scientific workflow systems is sometimes divided into three distinct types, or

layers 67 : workflow description , workflow evolution , and workflow execution .

The workflow description layer consists of the specifications of individual

workflows. The workflow evolution layer captures the relationships among

a series of workflow specifications that are created in the course of defining

an exploratory analysis. Finally, the execution layer stores runtime informa-

tion about the execution of a workflow. This information may include, for

example, the day and time the workflow was run, the execution time of each

workflow step, the data provided to and generated by each step, a description

of the workflow deployment environment, and so on. There are many ways to

store information in each layer. For example, in VisTrails, a “change-based”

model is used to represent both the evolution and workflow layers, 54 runtime

information is captured by the workflow execution engine and stored in a rela-

tional database. The three layers are related by the overall provenance storage

infrastructure.

The separation of provenance information into distinct layers can lead to

a more normalized representation that avoids storing portions of each layer

redundantly. For instance, this is in contrast to provenance approaches that

store information about the workflow specification within the execution log,

where a module name, the module parameters, and the parameter values are

saved for each invocation of a given module. Separating provenance informa-

tion into distinct layers can also help provenance frameworks become more

extensible, for example, by allowing layers to be replaced with new represen-

tation approaches or by allowing entirely new layers to be added. 90

The VisTrails workflow evolution approach captures changes to work-

flow specifications and displays these changes using a history tree called a

visualization trail ,or vistrail for short. 68

As a workflow developer makes

Scientific Data Management

Search WWH ::

Custom Search

Home