Data Extraction, Transformation and Integration Guided by an Ontology - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

Figure 5. A query (on the left side) and the extracted data (on the right side) from S 1

to the data of S 1 and S 2 . The problem consists in

deciding whether references are reconciled or not

reconciled. Let Reconcile be a binary predicate.

Reconcile(X, Y) means that the two references

denoted by X and Y refer to the same world entity.

The reference reconciliation problem considered

in L2R consists in extracting from the set I 1 × I 2

of reference pairs two subsets REC and NREC

such that:

reconciliation ( ¬Reconcile(i',j') ) from a set of facts

and a set of rules which transpose the semantics of

the data sources and of the schema into logical de-

pendencies between reference reconciliations. Facts

of synonymy ( SynVals(v 1 ,v 2 ) ) and of no synonymy

( ¬ SynVals(u 1 , u 2 ) ) between basic values (strings,

dates) are also inferred. For instance, the synonymy

SynVals(“JoDS”, “Journal of Data Semantics”)

may be inferred. The L2R distinguishing features

are that it is global and logic-based: every constraint

declared on the data and on the schema in RDFS+ is

automatically translated into first-order logic Horn

rules (rules for short) that express dependencies

between reconciliations. The advantage of such

a logical approach is that if the data are error-free

and if the declared constraints are valid, then the

reconciliations and non-reconciliations that are

inferred are correct, thus guaranteeing a 100%

precision of the results.

We first describe the generation of the recon-

ciliation rules. Then we present the generation

of the facts and finally the reasoning, which is

performed on the set of rules and facts.

{

}

()



REC

=

ii

, ' /

Reconcileii

( , ')



{

}

()

NREC

=

ii ØReconcile ii

, ',

(, ')



The reference reconciliation problem consid-

ered in N2R consists in, given a similarity function

Sim r :I 1 × I 2 → [0..1], and a threshold T rec (a real

value in [0..1] given by an expert, fixed experi-

mentally or learned on a labeled data sample),

computing the following set:

REC

=

{( ,')(

i i

∈ ×

I

)\(

EC

∪

NREC

),

tq Simii

.

(, ')

>

T

}

NR

2

1

2

r

rec

L2R: A Logical Method for

Reference Reconciliation

Generation of the Set of Reconciliation Rules

They are automatically generated from the con-

straints that are declared on the data sources and

on their common schema.

L2R (Saïs et al., 2007) is based on the inference of

facts of reconciliation ( Reconcile(i,j) ) and of non-

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home