Uncertainty in Data Integration and Dataspace Support Platforms - Schema Matching and Mapping

Databases Reference

In-Depth Information

In this chapter, we argue that as the scope of data integration applications broad-

ens, such systems need to be able to model uncertainty at their core. Uncertainty can

arise for multiple reasons in data integration. First, the semantic mappings between

the data sources and the mediated schema may be approximate. For example, in

an application like Google Base ( GoogleBase 2005 ) that enables anyone to upload

structured data, or when mapping millions of sources on the deep Web ( Madhavan

et al. 2007 ), we cannot imagine specifying exact mappings. In some domains (e.g.,

bioinformatics), we do not necessarily know what the exact mapping is. Second,

data are often extracted from unstructured sources using information extraction

techniques. Since these techniques are approximate, the data obtained from the

sources may be uncertain. Third, if the intended users of the application are not

necessarily familiar with schemata, or if the domain of the system is too broad to

offer form-based query interfaces (such as Web forms), we need to support key-

word queries. Hence, another source of uncertainty is the transformation between

keyword queries and a set of candidate structured queries. Finally, if the scope of

the domain is very broad, there can even be uncertainty about the concepts in the

mediated schema.

Another reason for data integration systems to model uncertainty is to support

pay-as-you-go integration. Dataspace Support Platforms ( Halevy et al. 2006a )envi-

sion data integration systems where sources are added with no effort and the system

is constantly evolving in a pay-as-you-go fashion to improve the quality of semantic

mappings and query answering. This means that as the system evolves, there will

be uncertainty about the semantic mappings to its sources, its mediated schema, and

even the semantics of the queries posed to it.

This chapter describes some of the formal foundations for data integration with

uncertainty. We define probabilistic schema mappings and probabilistic mediated

schemas and show how to answer queries in their presence. With these foundations,

we show that it is possible to completely automatically bootstrap a pay-as-you-go

integration system.

This chapter is largely based on previous papers ( Dong et al. 2007 ; Sarma et al.

2008 ). The proofs of the theorems we state and the experimental results validating

some of our claims can be found there in. We also place several other works on

uncertainty in data integration in the context of the system we envision. In the next

section, we describe an architecture for data integration system that incorporates

uncertainty.

2

Overview of the System

This section describes the requirements from a data integration system that supports

uncertainty and the overall architecture of the system.

Search WWH ::

Custom Search

Home