Database Reference
In-Depth Information
0.68, which is the similarity score between the
two strings “ Le Louvre ” and “ musée du Louvre
computed by the Jaro-Winkler function (Cohen
et al., 2003). It also expresses that it weakly de-
pends on x 4 .
The reason of the strong dependencies is
that contains is an inverse functional relation
(a painting is contained in only one museum)
relating S1_r607 and S2_r208 (the similarity
of which is modeled by x 1 ) to S1_p112 for
S1_r607 and S2_p222 for S2_r208, and name
is a functional attribute (a museum has only
one name) relating S1_r607 and S2_r208 re-
spectively to the two strings “ Le Louvre'' and
musee du Louvre ”.
The weak dependency of x 4 onto x 1 is ex-
pressed by the term x 4 /4 in the equation, where
the ratio ¼ comes from that there are 4 prop-
erties (relations or attributes) involved in the
data descriptions of S1_r607 and S2_r208. The
dependency of x 4 onto x 1 is weaker than the
previous ones because located is not an inverse
functional relation.
we have neither expert knowledge nor training
data, the weights are computed in function of the
number of the common attributes and relations.
See (Saïs et al., 2009) for the detailed definition
of f i-df (X) and f i-ndf (X).
Iterative Algorithm for Reference Pairs Simi-
larity Computation
To compute the similarity scores, we have imple-
mented an iterative resolution method. At each
iteration, the method computes the variables
values by using those computed in the precedent
iteration.
Starting from an initial vector X 0 =(x 1 0 , x 2 0 ,...,
x n 0 ) , the value of the vector X at the k -th iteration
is obtained by the expression: X k = F(X k-1 ). At each
iteration k we compute the value of each x i k : x i k =
fi(x 1 k-1 , x 2 k-1 ,... x n k-1 ) until a fixpoint with precision
ε is reached. The fixpoint is reached when: ∀ i,
|x i k - x i k-1 | ≤ ε . The more ε value is small the more
the set of reconciliations may be large.
The complexity of this method is in ( n 2 ) for
each iteration, where n is the number of variables.
We have proved its convergence for the resolution
of our equation system.
The similarity computation is illustrated by
the following equation system obtained from
the data descriptions shown in Example 1. The
constants correspond to the similarity scores of
pairs of basic values computed by using the Jaro-
Winkler measure. The constants involved in the
value computation of the variables x 1 , x 2 , x 3 and
x 4 are respectively:
The Equations Modeling the Dependencies
between Similarities
For each pair of references, its similarity score is
modeled by a variable x i and the way it depends
on other similarity scores is modeled by an equa-
tion: x i = f i (X), where i [1..n] and n is the number
of reference pairs for which we apply N2R, and
X=(x 1 , x 2 , …, x n ). Each equation x i = f i (X) is of the
form: f i fi(X) = max(f i-df fi(X) f i-ndf (X)).
The function f i-df (X) is the maximum of the
similarity scores of the value pairs and the refer-
ence pairs of attributes and relations with which the
i i-th reference pair is functionally dependent. The
maximum function allows propagating the similar-
ity scores of the values and the references having
a strong impact. The function f i-ndf (X) is defined by
a weighted average of the similarity scores of the
values pairs (and sets) and the reference pairs (and
sets) of attributes and relations with which the i -th
reference pair is not functionally dependent. Since
b11= Sim v (“Louvre”, “musée du Louvre”)
= 0.68
b21= Sim v (“La Joconde”, “Iris”) = 0.1
b31= Sim v (“La Joconde”, “Joconde”) =
0.9
b41= Sim v (“Paris”, “Ville de Paris”) =
0.42
The weights are computed in function of the
number of common attributes and common rela-
Search WWH ::




Custom Search