Databases Reference
In-Depth Information
In this chapter, we argue that as the scope of data integration applications broad-
ens, such systems need to be able to model uncertainty at their core. Uncertainty can
arise for multiple reasons in data integration. First, the semantic mappings between
the data sources and the mediated schema may be approximate. For example, in
an application like Google Base ( GoogleBase 2005 ) that enables anyone to upload
structured data, or when mapping millions of sources on the deep Web ( Madhavan
et al. 2007 ), we cannot imagine specifying exact mappings. In some domains (e.g.,
bioinformatics), we do not necessarily know what the exact mapping is. Second,
data are often extracted from unstructured sources using information extraction
techniques. Since these techniques are approximate, the data obtained from the
sources may be uncertain. Third, if the intended users of the application are not
necessarily familiar with schemata, or if the domain of the system is too broad to
offer form-based query interfaces (such as Web forms), we need to support key-
word queries. Hence, another source of uncertainty is the transformation between
keyword queries and a set of candidate structured queries. Finally, if the scope of
the domain is very broad, there can even be uncertainty about the concepts in the
mediated schema.
Another reason for data integration systems to model uncertainty is to support
pay-as-you-go integration. Dataspace Support Platforms ( Halevy et al. 2006a )envi-
sion data integration systems where sources are added with no effort and the system
is constantly evolving in a pay-as-you-go fashion to improve the quality of semantic
mappings and query answering. This means that as the system evolves, there will
be uncertainty about the semantic mappings to its sources, its mediated schema, and
even the semantics of the queries posed to it.
This chapter describes some of the formal foundations for data integration with
uncertainty. We define probabilistic schema mappings and probabilistic mediated
schemas and show how to answer queries in their presence. With these foundations,
we show that it is possible to completely automatically bootstrap a pay-as-you-go
integration system.
This chapter is largely based on previous papers ( Dong et al. 2007 ; Sarma et al.
2008 ). The proofs of the theorems we state and the experimental results validating
some of our claims can be found there in. We also place several other works on
uncertainty in data integration in the context of the system we envision. In the next
section, we describe an architecture for data integration system that incorporates
uncertainty.
2
Overview of the System
This section describes the requirements from a data integration system that supports
uncertainty and the overall architecture of the system.
Search WWH ::




Custom Search