Data Extraction, Transformation and Integration Guided by an Ontology - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

0.68, which is the similarity score between the

two strings “ Le Louvre ” and “ musée du Louvre ”

computed by the Jaro-Winkler function (Cohen

et al., 2003). It also expresses that it weakly de-

pends on x 4 .

The reason of the strong dependencies is

that contains is an inverse functional relation

(a painting is contained in only one museum)

relating S1_r607 and S2_r208 (the similarity

of which is modeled by x 1 ) to S1_p112 for

S1_r607 and S2_p222 for S2_r208, and name

is a functional attribute (a museum has only

one name) relating S1_r607 and S2_r208 re-

spectively to the two strings “ Le Louvre'' and

“ musee du Louvre ”.

The weak dependency of x 4 onto x 1 is ex-

pressed by the term x 4 /4 in the equation, where

the ratio ¼ comes from that there are 4 prop-

erties (relations or attributes) involved in the

data descriptions of S1_r607 and S2_r208. The

dependency of x 4 onto x 1 is weaker than the

previous ones because located is not an inverse

functional relation.

we have neither expert knowledge nor training

data, the weights are computed in function of the

number of the common attributes and relations.

See (Saïs et al., 2009) for the detailed definition

of f i-df (X) and f i-ndf (X).

Iterative Algorithm for Reference Pairs Simi-

larity Computation

To compute the similarity scores, we have imple-

mented an iterative resolution method. At each

iteration, the method computes the variables

values by using those computed in the precedent

iteration.

Starting from an initial vector X 0 =(x 1 0 , x 2 0 ,...,

x n 0 ) , the value of the vector X at the k -th iteration

is obtained by the expression: X k = F(X k-1 ). At each

iteration k we compute the value of each x i k : x i k =

fi(x 1 k-1 , x 2 k-1 ,... x n k-1 ) until a fixpoint with precision

ε is reached. The fixpoint is reached when: ∀ i,

|x i k - x i k-1 | ≤ ε . The more ε value is small the more

the set of reconciliations may be large.

The complexity of this method is in ( n 2 ) for

each iteration, where n is the number of variables.

We have proved its convergence for the resolution

of our equation system.

The similarity computation is illustrated by

the following equation system obtained from

the data descriptions shown in Example 1. The

constants correspond to the similarity scores of

pairs of basic values computed by using the Jaro-

Winkler measure. The constants involved in the

value computation of the variables x 1 , x 2 , x 3 and

x 4 are respectively:

The Equations Modeling the Dependencies

between Similarities

For each pair of references, its similarity score is

modeled by a variable x i and the way it depends

on other similarity scores is modeled by an equa-

tion: x i = f i (X), where i ∈ [1..n] and n is the number

of reference pairs for which we apply N2R, and

X=(x 1 , x 2 , …, x n ). Each equation x i = f i (X) is of the

form: f i fi(X) = max(f i-df fi(X) f i-ndf (X)).

The function f i-df (X) is the maximum of the

similarity scores of the value pairs and the refer-

ence pairs of attributes and relations with which the

i i-th reference pair is functionally dependent. The

maximum function allows propagating the similar-

ity scores of the values and the references having

a strong impact. The function f i-ndf (X) is defined by

a weighted average of the similarity scores of the

values pairs (and sets) and the reference pairs (and

sets) of attributes and relations with which the i -th

reference pair is not functionally dependent. Since

•

b11= Sim v (“Louvre”, “musée du Louvre”)

= 0.68

•

b21= Sim v (“La Joconde”, “Iris”) = 0.1

•

b31= Sim v (“La Joconde”, “Joconde”) =

0.9

•

b41= Sim v (“Paris”, “Ville de Paris”) =

0.42

The weights are computed in function of the

number of common attributes and common rela-

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home