Database Reference
In-Depth Information
wrappers given a set of mappings. It starts from
the description of the abstract content of an ex-
ternal source and performs data acquisition, i.e.
data extraction and transformation in order to
conform to a same global schema. The description
of the abstract content of an external source is also
usable to manage sources with data that remain
locally stored, making that way our techniques
quite integrated to the PICSEL mediator-based
approach. The transformation phase is then fol-
lowed by a reconciliation step whose aim is to
handle several problems: possible mismatches
between data referring to the same real world
object (different conventions and vocabularies can
be used to represent and describe data), possible
errors in the data stored in the sources especially
frequent when data are automatically extracted
from the Web, possible inconsistencies between
values representing the properties of the real world
objects in different sources. This reconciliation
step is essential because the conformity to a same
global schema does not indeed prevent variations
between data descriptions. For this last step, we
propose a knowledge-based and unsupervised
approach, based on two methods, a logical one
called L2R and a numerical one called N2R. The
Logical method for Reference Reconciliation
(L2R) is based on the translation in first order logic
Horn rules of some of the schema semantics. In
order to complement the partial results of L2R, we
have designed a Numerical method for Reference
Reconciliation (N2R). It exploits the L2R results
and allows computing similarity scores for each
pair of references.
The paper is organized as follows. In section
2, we present close related work and point out the
novel features of the approach presented in this
chapter. In section 3, we describe our approach.
First, we define the data model used to represent
the ontology and the data, the XML sources and the
mappings automatically generated used as inputs
in the data extraction and transformation process.
We present the data extraction and transformation
tasks and then the two reconciliation techniques
(L2R and N2R) followed by a summary of the
results that we have obtained. In section 4 we
briefly describe future research directions. Finally,
section 5 concludes the chapter.
BACKGROUND
Many modern applications such as data warehous-
ing, global information systems and electronic
commerce need to take existing data with a par-
ticular schema, and reuse it in a different form. For
a long time data conversion has usually been done
in an ad hoc manner by developing non reusable
software. Later language-based and declarative
approaches have provided tools for the specifi-
cation and implementation of data and schema
translations among heterogeneous data sources
(Abiteboul et al.,1997; Cluet et al., 1998). Such
rule-based approaches can deal with complex
transformations due to the diversity in the data
model and to schema matching. In the former case,
the approach helps to customize general purpose
translation tools. In the latter case, the idea is
that the system automatically finds the matching
between two schemas, based on a set of rules that
specify how to perform the matching. All these
works provide tools to design data conversion
programs but they do not provide the ability to
query external sources. More recently, the Clio
system (Popa et al., 2002) has been proposed as
a complement and an extension of the language-
based approaches. Given value correspondences
that describe how to populate a single attribute of
a target schema, this system discovers the mapping
query needed to transform source data to target
data. It produces SQL queries and provides users
with data samples to allow them to understand the
mappings produced.
Our work can also be compared to data integra-
tion systems providing mechanisms for uniformly
querying sources through a target schema but
avoiding materializing it in advance. These works
adopt either the Global-As-View (GAV) approach
Search WWH ::




Custom Search