Data Extraction, Transformation and Integration Guided by an Ontology - Data Warehousing Design and Advanced Engineering Applications - page 31

Database Reference

In-Depth Information

Table 1. Example of iterative similarity computation

Iterations

0

1

2

3

4

x 1 =max(0.68, x 2 ,x 3 ,¼ * x4 )

0

0.68

0.9

0.9

0.9

x 2 =max(0.1, ½ * x 1 )

0

0.1

0.34

0.45

0.45

x 3 =max(0.9, ½ * x 1 )

0

0.9

0.9

0.9

0.9

x 4 =max(0.42, x 1 )

0

0.42

0.68

0.9

0.9

tions of the reference pairs. The weights used in

the value computation of the variables x 1 , x 2 , x 3

and x 4 are respectively: λ11 = ¼, λ21 = ½, λ31=

½ and λ41 = ½.

We assume that fixpoint precision ε is equal

to 0.005.

The equation system is the one given in Ex-

ample 2. The different iterations of the resulting

similarity computation are provided in Table 1.

The solution of the equation system is

X=(0.9,0.45,0.9,0.9) . This corresponds to the

similarity scores of the four reference pairs. The

fixpoint has been reached after four iterations.

The error vector is then equal to 0. If we fix

the reconciliation threshold T rec at 0.80, then we

obtain three reconciliation decisions: two cities,

two museums and two paintings.

call and the precision can be easily obtained by

computing the ratio of the reconciliations or non-

reconciliations obtained by L2R and N2R among

those that are provided in the benchmark.

L2R Results

Since the set of reconciliations and the set of

non-reconciliations are obtained by a logical

resolution-based algorithm the precision is of

100% by construction. Then, the measure that

it is meaningful to evaluate in our experiments

is the recall. We focus on the results obtained

for the Article and Conference classes, which

contain respectively 1295 references and 1292

references.

As presented in the column named “RDFS+”

of the Table 2, the recall is 50.7%. This can be

refined in a recall of 52.7% computed on the

REC subset and a recall of 50.6% computed on

NREC subset.

For this data set, the RDFS+ schema can be

easily enriched by the declaration that the prop-

erty confYear is discriminant. When this property

is exploited, the recall on NREC subset grows

to 94.9%, as it is shown in the “RDFS+ & DP”

column. This significant improvement is due to

chaining of different rules of reconciliations:

the non-reconciliations on references to confer-

ences for which the values of the confYear are

different entail in turn non-reconciliations of the

associated articles by exploiting the constraint

PF( published ).

This recall is comparable to (while a little bit

lower than) the recall on the same data set ob-

tained by supervised methods like e.g., (Dong et

Experiments

L2R and N2R have been implemented and tested

on the benchmark Cora ii (used by (Dong et al.,

2005; Parag & Domingos, 2005)). It is a collection

of 1295 citations of 112 different research papers

in computer science. For this data set, the UNA is

not stated and the RDF facts describe references,

which belong to three different classes ( Article ,

Conference , Person ). We have designed a simple

RDFS schema on the scientific publication do-

main, which we have enriched with disjunction

constraints (e.g. DISJOINT( Article , Conference )),

a set of functional property constraints (e.g.

PF( published ), PF( confName )) and a set of inverse

functional property constraints (e.g. PFI( little ,

year , type ), PFI( confName , confYear )). The re-

Next Page

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home