Data Extraction, Transformation and Integration Guided by an Ontology - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

and describes the target schema in terms of the

local schemas, either the Local-As-View (LAV)

approach and describes every source schema

in terms of the target one. Based on these two

approaches, there is a hybrid approach, called

Global-Local-As-View (GLAV) and performed

in SWIM (Koffina et al., 2006), that allows

to specify mappings between elements of the

target schema and elements of the source ones,

considered one by one. We adopted it also in our

work. It simplifies the definition of the mappings

and allows a higher automation of extraction and

transformation tasks.

Compared with the approaches cited above,

the present work shows several interesting

features coming both from data conversion and

data integration (mediator) work. Given a set of

mappings, our approach is entirely automatic.

Our solution has to be integrated in the PICSEL

mediator-based approach. In PICSEL, queries are

rewritten in terms of views which describe the

content of the sources. Hence, a solution to data

extraction and transformation that generates these

views in an automatic way in the same time is a

very interesting point. The specification of how

to perform the matching between the sources and

the data warehouse can then be automatically

generated by producing XML queries from the

mappings, the views and the ontology. It corre-

sponds to the extraction and transformation steps

performed on the source taken as a whole and not

attribute per attribute as in the work aiming at

converting a relational database in another one.

The approach is directed by the ontology. Only

data that can be defined in terms of the ontol-

ogy are extracted. Furthermore XML queries

are capable to transform data in order to make

them defined in terms of the ontology as well as

in the same format. This is a way to handle the

transformation task.

The problem of reference reconciliation was

introduced by the geneticist Newcombe (1959)

and was first formalized by (Fellegi & Sunter,

1969). Since then, several work and various ap-

proaches have been proposed. We distinguish

these approaches according to the exploitation

of the reference description, to how knowledge

is acquired and which kind of result is obtained

by the methods.

For the reference description we have three

cases. The first one is the exploitation of the un-

structured description of the text appearing in the

attributes (Cohen, 2000; Bilke & Naumann, 2005).

In these approaches, the similarity is computed

by using only the textual values in the form of a

single long string without distinguishing which

value corresponds to which attribute. This kind

of approaches is useful in order to have a fast

similarity computation (Cohen, 2000), to obtain

a set of reference pairs that are candidates for the

reconciliation (Bilke & Naumann, 2005) or when

the attribute-value associations may be incorrect.

The second type of approaches consists in con-

sidering the reference description as structured

in several attributes. A large number of methods

have adopted this vision by proposing either proba-

bilistic models (Fellegi & Sunter, 1969), which

allow taking decisions of reconciliation after the

estimation of the probabilistic model parameters,

or by computing a similarity score for the refer-

ence pairs (Dey et al., 1998a) by using similarity

measures (Cohen et al., 2003). The third one con-

sists in considering, in addition to the reference

description structured in a set of attributes, the

relations that link the references together (Dong

et al., 2005). These global approaches take into

account a larger set of information. This allows

to improve the results in terms of the number of

false positive (Bhattacharya & Getoor, 2006) or

in terms of the number of the false negative. Like

those approaches, both the logical L2R and the

numerical N2R methods are global, since they

exploit the structured description composed of

attributes and relations. The relations are used both

in the propagation of reconciliation decisions by

the logical rules (L2R) and in the propagation of

similarity scores through the iterative computation

of the similarity (N2R).

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home