Metadata and Provenance Management - Scientific Data Management

Database Reference

In-Depth Information

is not required by end users, since their needs tend to be limited to specific

tasks, such as experiment reproducibility or validation of an analysis.

To support the vision of provenance of electronic data, we make the distinc-

tion between process documentation , a representation of past processes as they

occur inside computer systems, and provenance queries , extracting relevant

information from process documentation to support users' needs.

Process documentation is collected during execution of processes or work-

flows and begins to be accumulated well before data are produced, or even

before it is known that some dataset is to be produced. Hence, management

of such process documentation is different from metadata management. In

practice, in a given application context, users may identify commonly asked

provenance queries, which can be precomputed, and for which the results are

stored and made available.

Similar to the earlier discussion of different metadata layers, we can think

of provenance as consisting of descriptions at different levels of abstraction,

essentially aimed at different audiences: to support scientific reproducibil-

ity, engineering reproducibility, or even deeper understanding of the process

that created the derived data (we provide an example of the latter in the

context of scientific workflows below). In terms of scientific reproducibility,

where scientists want to share and verify their findings with colleagues in-

side or outside their collaboration, the user may need to know what datasets

were used and what type of analysis with what parameters were used. How-

ever, in cases where the results need to be reproduced bit by bit, more

detailed information about the hardware architecture of the resource, envi-

ronment variables used, library versions, and the like are needed. Finally,

provenance can also be used to analyze the performance of the analyses, 17

where the provenance records are mined to determine the number of tasks

executed, their runtime distribution, where the execution took place, and

so forth.

In some cases, scientific processes are managed by workflow management

systems. These may take in an abstract workflow description and generate an

executable workflow. During the mapping the workflow system may modify

the executable workflow to the point that it is no longer easy to map between

what has been executed and what the user specified. 18 As a result, informa-

tion about the workflow restructuring process needs to be recorded as well. 19

This information not only allows us to relate the user-created and the exe-

cutable workflow but is also the foundation for workflow debugging, where the

user can trace how the specification they provided evolved into an executable

workflow.

In the area of workflow management and provenance, an interesting aspect

of workflow creation is the ability to retrace how a particular workflow has

been designed, or in other words, to determine the provenance of the workflow

creation process. A particularly interesting approach is taken in VisTrails 20 , 21

where the user is presented with a graphical interface for workflow creation

and the system incrementally saves the state of the workflow as it is being

Scientific Data Management

Search WWH ::

Custom Search

Home