Metadata and Provenance Management - Scientific Data Management

Database Reference

In-Depth Information

approaches, technologies, and implementations. In the recent provenance

challenge, 50 16 different systems were used to answer typical provenance

queries pertaining to a brain atlas dataset that was produced by a demon-

strator workflow in the context of functional magnetic resonance imaging.

Inspired by the summary of contributions in Moreau and Ludaescher, 50 we

present key characteristics of provenance systems. Most provenance systems

are embedded inside an execution environment, such as a workflow system or

an operating system. In such a context, embedded provenance systems can

track all the activities of this execution environment and are capable of pro-

viding a description of data produced by such environments. We characterize

such systems as integrated environments , since they offer multiple functional-

ities, including workflow editing, workflow execution, provenance collection,

and provenance querying. 21 , 51 , 52 Integrated environments have some benefits,

including usability and seamless integration between the different activities.

From a provenance viewpoint, there is close semantic integration between the

provenance representation and the workflow model, which allows ecient rep-

resentation to be adopted. 53 The downside of integrated systems is that the

tight coupling of components rarely allows for their substitution or use in

combination with other useful technologies; such systems therefore have di-

culties interoperating with others, a requirement of many large-scale scientific

applications.

In contrast to integrated provenance environments, approaches such as

Provenance-Aware Service-Oriented Architecture (PASOA) 54 , 55 and Karma 56

adopt separate, autonomous provenance stores. As execution proceeds, appli-

cations produce process documentation that is recorded in a storage system,

usually referred to as a provenance store . Such systems give the provenance

store an important role, since it offers long-term, persistent, secure storage of

process documentation. Provenance of data products can be extracted from

provenance stores by issuing queries to them. Over time, provenance stores

need to be managed to ensure that process documentation remains accessible

and usable in the long term. In particular, PASOA has adopted a provenance

model that is independent of the technology used for executing the appli-

cation. PASOA was demonstrated to operate with multiple workflow tech-

nologies, including Pegasus, 19 Virtual Data Language (VDL) 57 and Business

Process Execution Language (BPEL). 58 This approach that favors open data

models and open interfaces allows the scientist to adopt the technologies of

their choice to run applications. However, a common provenance model would

allow for past executions to be described in a coherent manner, even when

multiple technologies are involved.

All provenance systems rely on some form of database management system

to store their data, and RDF and SQL stores were the preferred technologies.

Associated query languages are used to express provenance queries, but some

systems use query templates and query interfaces that are specifically prove-

nance oriented, helping users to express precisely and easily their provenance

questions without having to understand the underpinning schemas adopted

by the implementations.

Scientific Data Management

Search WWH ::

Custom Search

Home