Databases Reference
In-Depth Information
M
i,σ (i)
, represents the similarity value of attribute
i
in
S
, and its matching counterpart, attribute
σ(i)
,in
S
.
f (σ, M)
is a function that aggregates the similarity measures associated with individual
attribute correspondences, forming a schema matching
σ
. A popular choice of a local aggregator
is the sum (or average) of attribute correspondence similarity measures (
e.g.
,[
Do and Rahm
,
2002
,
Gal et al.
,
2005b
,
Melnik et al.
,
2002
]), but other local aggregators have been found appealing as
well. For example, the
Dice
local aggregator, suggested by
Do and Rahm
[
2002
], is the ratio of
the number of successfully matched attributes (those whose similarity measure has passed a given
threshold) and the total number of attributes in both schemata. Threshold-based aggregators have
been presented as well,
e.g.
,by
Modica et al.
[
2001
].
f
is typically assumed to be computable in
linear time in the matrix size. However, at least technically, there is no restriction on the use of more
sophisticated (and possibly more computation-intense) local aggregators.
Given two schemata
S
and
S
, an ensemble of
m
schema matchers may utilize differ-
ent local aggregators
f
(
1
)
,...,f
(m)
. Each local aggregator computes the similarity measure of
a matching of a different matchers and may be tied to the specific capabilities of the matcher.
For example, it may be more meaningful to apply an
average
aggregator than a min aggrega-
tor to a matcher that does not use a threshold. The
m
matchers produce an
m
×
n
×
n
similar-
ity cube of
n
×
n
similarity matrices
M
(
1
)
,...,M
(m)
. The similarity measures produced by such
an ensemble of schema matchers can be aggregated, using a real-valued
global aggregation function
F
f
(
1
)
(σ, M
(
1
)
),
···
,f
(m)
(σ, M
(m)
)
[
Do and Rahm
,
2002
,
Gal et al.
,
2005b
].
f,F
denotes the
set of local and global aggregators, respectively. The aggregated weight provided by the
m
matchers
with
f,F
to the matching
σ
is given as
F
f
(
1
)
(σ, M
(
1
)
),
,f
(m)
(σ, M
(m)
)
f,F
(σ )
≡
···
Many global aggregators proposed in the literature can be generalized as
F
f
(
1
)
(σ, M
(
1
)
),
···
,f
(m)
(σ, M
(m)
)
m
λ
m
k
l
f
(l)
(σ, M
(l)
),
=
(4.1)
l
=
1
where Eq.
4.1
can be interpreted as a (weighted) sum (with
λ
=
m
) or a (weighted) average (with
λ
1) of the local similarity measures, and where
k
l
are some arbitrary weighting parameters. It is
important to note that the choice of a global aggregator is ensemble-dependent, and it is considered
to be a
given
property of the ensemble.
This model represents just one possible ensemble design, a
linear parallel multiple-matcher
design model. We now extend this model in three different dimensions, to demonstrate the ensemble
design space. The first two dimensions are illustrated in Table
4.1
, with representative examples for
each design decision in the space.
=
Participation dimension:
Determining the participating schema matchers in an ensemble is an
important tuning parameter of the matching process. In Section
4.3
, we provide a method for
matcher selection. Works in the literature typically construct matcher ensembles from multiple