Database Reference
In-Depth Information
and describes the target schema in terms of the
local schemas, either the Local-As-View (LAV)
approach and describes every source schema
in terms of the target one. Based on these two
approaches, there is a hybrid approach, called
Global-Local-As-View (GLAV) and performed
in SWIM (Koffina et al., 2006), that allows
to specify mappings between elements of the
target schema and elements of the source ones,
considered one by one. We adopted it also in our
work. It simplifies the definition of the mappings
and allows a higher automation of extraction and
transformation tasks.
Compared with the approaches cited above,
the present work shows several interesting
features coming both from data conversion and
data integration (mediator) work. Given a set of
mappings, our approach is entirely automatic.
Our solution has to be integrated in the PICSEL
mediator-based approach. In PICSEL, queries are
rewritten in terms of views which describe the
content of the sources. Hence, a solution to data
extraction and transformation that generates these
views in an automatic way in the same time is a
very interesting point. The specification of how
to perform the matching between the sources and
the data warehouse can then be automatically
generated by producing XML queries from the
mappings, the views and the ontology. It corre-
sponds to the extraction and transformation steps
performed on the source taken as a whole and not
attribute per attribute as in the work aiming at
converting a relational database in another one.
The approach is directed by the ontology. Only
data that can be defined in terms of the ontol-
ogy are extracted. Furthermore XML queries
are capable to transform data in order to make
them defined in terms of the ontology as well as
in the same format. This is a way to handle the
transformation task.
The problem of reference reconciliation was
introduced by the geneticist Newcombe (1959)
and was first formalized by (Fellegi & Sunter,
1969). Since then, several work and various ap-
proaches have been proposed. We distinguish
these approaches according to the exploitation
of the reference description, to how knowledge
is acquired and which kind of result is obtained
by the methods.
For the reference description we have three
cases. The first one is the exploitation of the un-
structured description of the text appearing in the
attributes (Cohen, 2000; Bilke & Naumann, 2005).
In these approaches, the similarity is computed
by using only the textual values in the form of a
single long string without distinguishing which
value corresponds to which attribute. This kind
of approaches is useful in order to have a fast
similarity computation (Cohen, 2000), to obtain
a set of reference pairs that are candidates for the
reconciliation (Bilke & Naumann, 2005) or when
the attribute-value associations may be incorrect.
The second type of approaches consists in con-
sidering the reference description as structured
in several attributes. A large number of methods
have adopted this vision by proposing either proba-
bilistic models (Fellegi & Sunter, 1969), which
allow taking decisions of reconciliation after the
estimation of the probabilistic model parameters,
or by computing a similarity score for the refer-
ence pairs (Dey et al., 1998a) by using similarity
measures (Cohen et al., 2003). The third one con-
sists in considering, in addition to the reference
description structured in a set of attributes, the
relations that link the references together (Dong
et al., 2005). These global approaches take into
account a larger set of information. This allows
to improve the results in terms of the number of
false positive (Bhattacharya & Getoor, 2006) or
in terms of the number of the false negative. Like
those approaches, both the logical L2R and the
numerical N2R methods are global, since they
exploit the structured description composed of
attributes and relations. The relations are used both
in the propagation of reconciliation decisions by
the logical rules (L2R) and in the propagation of
similarity scores through the iterative computation
of the similarity (N2R).
Search WWH ::




Custom Search