Databases Reference
In-Depth Information
2.1
Uncertainty in Data Integration
A data integration system needs to handle uncertainty at four levels.
Uncertain mediated schema: The mediated schema is the set of schema terms in
which queries are posed. They do not necessarily cover all the attributes appearing
in any of the sources, but rather the aspects of the domain that the application builder
wishes to expose to the users. Uncertainty in the mediated schema can arise for sev-
eral reasons. First, as we describe in Sect. 4 , if the mediated schema is automatically
inferred from the data sources in a pay-as-you-go integration system, there will be
some uncertainty about the results. Second, when domains get broad, there will be
some uncertainty about how to model the domain. For example, if we model all
the topics in Computer Science, there will be some uncertainty about the degree of
overlap between different topics.
Uncertain schema mappings: Data integration systems rely on schema mappings
for specifying the semantic relationships between the data in the sources and the
terms used in the mediated schema. However, schema mappings can be inaccu-
rate. In many applications, it is impossible to create and maintain precise mappings
between data sources. This can be because the users are not skilled enough to pro-
vide precise mappings, such as in personal information management ( Dong and
Halevy 2005 ), since people do not understand the domain well and thus do not even
know what correct mappings are, such as in bioinformatics, or since the scale of the
data prevents generating and maintaining precise mappings, such as in integrating
data of the Web scale ( Madhavan et al. 2007 ). Hence, in practice, schema mappings
are often generated by semiautomatic tools and not necessarily verified by domain
experts.
Uncertain data: By nature, data integration systems need to handle uncertain data.
One reason for uncertainty is that data are often extracted from unstructured or
semistructured sources by automatic methods (e.g., HTML pages, emails, blogs).
A second reason is that data may come from sources that are unreliable or not up to
date. For example, in enterprise settings, it is common for informational data such as
gender, racial, and income level to be dirty or missing, even when the transactional
data are precise.
Uncertain queries: In some data integration applications, especially on the Web,
queries will be posed as keywords rather than as structured queries against a well-
defined schema. The system needs to translate these queries into some structured
form so that they can be reformulated with respect to the data sources. At this
step, the system may generate multiple candidate structured queries and have some
uncertainty about which is the real intent of the user.
Search WWH ::




Custom Search