Database Reference
In-Depth Information
possible to use session logs generated by software tools to capture provenance
information, 61 these logs may not be represented in a format that can be easily
queried; and further, they may require sophisticated programming techniques
to uncover the information needed to answer the above questions related to
data lineage.
Similar to a workflow specification, the provenance of a workflow run is of-
ten represented using graph structures in which nodes represent processes and
data products, and edges capture the flow of data between processes. 60 , 62 Some
approaches support additional graph representations, for example, by record-
ing explicit process and data dependencies. 63 , 64 A complete description of
provenance models and capture mechanisms is beyond the scope of this chap-
ter (see, for example, Simmhan et al. 55 and Davidson and Freire 60 ). Instead, we
concentrate on describing one particular scheme of storing provenance infor-
mation that combines features from Kepler, 18 VisTrails, 65 and Pegasus. 53 We
first describe the different types of provenance information considered by these
approaches, and then discuss the current implementation of the provenance
framework used by the Scientific Data Management (SDM) Center. 66
Types of Provenance Information. Provenance information related to
scientific workflow systems is sometimes divided into three distinct types, or
layers 67 : workflow description , workflow evolution , and workflow execution .
The workflow description layer consists of the specifications of individual
workflows. The workflow evolution layer captures the relationships among
a series of workflow specifications that are created in the course of defining
an exploratory analysis. Finally, the execution layer stores runtime informa-
tion about the execution of a workflow. This information may include, for
example, the day and time the workflow was run, the execution time of each
workflow step, the data provided to and generated by each step, a description
of the workflow deployment environment, and so on. There are many ways to
store information in each layer. For example, in VisTrails, a “change-based”
model is used to represent both the evolution and workflow layers, 54 runtime
information is captured by the workflow execution engine and stored in a rela-
tional database. The three layers are related by the overall provenance storage
infrastructure.
The separation of provenance information into distinct layers can lead to
a more normalized representation that avoids storing portions of each layer
redundantly. For instance, this is in contrast to provenance approaches that
store information about the workflow specification within the execution log,
where a module name, the module parameters, and the parameter values are
saved for each invocation of a given module. Separating provenance informa-
tion into distinct layers can also help provenance frameworks become more
extensible, for example, by allowing layers to be replaced with new represen-
tation approaches or by allowing entirely new layers to be added. 90
The VisTrails workflow evolution approach captures changes to work-
flow specifications and displays these changes using a history tree called a
visualization trail ,or vistrail for short. 68
As a workflow developer makes
Search WWH ::




Custom Search