Databases Reference
In-Depth Information
leaf
u
j
and adding its children, constructed by the algorithm at iteration
i
+
1. By the induction
hypothesis, any matching different from
σ
1
,σ
2
..., σ
i
must satisfy constraints implied by
S
i
(u
l
)
and
n
i
. Therefore, any matching
σ
different from
σ
1
,σ
2
..., σ
i
must satisfy
one of the latter constraints. In addition, if
σ
satisfies constraints implied by
S
i
(u
j
)
and
S
e
(u
j
)
,
which correspond to the leaf expanded at the
i
-th iteration, then by Lemma
5.7
it must also satisfy
constraints implied by one of the children of
u
j
. We get that
σ
satisfies constraints
S
i
(x
p
)
and
S
e
(x
p
)
for some
p
, 1
≤
p
≤
n
i
+
1
, proving the induction step.
S
e
(u
l
)
for some
l
, 1
≤
l
≤
Algorithm
4
generates top-K matchings in G.
Theorem 5.9
Proof.
The correctness of the theorem follows by Lemma
5.6
and Lemma
5.8
.
Finally, we analyze the complexity of Algorithm
4
. The algorithm performs
k
iterations.
Each iteration requires invocation of the algorithm
A
best
O(
|
V
|
)
times and finding the maximum
matching on the tree frontier in
O(log(k))
. We conclude that the overall complexity of the algorithm
is
O(klog(k)
+
k
|
V
|
C(A
best
))
.
5.3
EXTENDING TOP-
K
IDENTIFICATION TO ENSEMBLES
In this section, we present several algorithms for connecting the ensemble approach of Chapter
4
and the top-
K
approach, increasing the robustness of the schema matching process by enjoying the
best of both worlds.
Domshlak et al.
[
2007
] introduced a generic computational framework,
schema
metamatching
, for computing the top-
K
prefix of a “consensus” ranking of alternative matchings be-
tween two schemata, given the graded valid schema matchings provided individually by the members
of an ensemble. Their starting point is based on rank aggregation techniques developed in the areas
of Web search and database middleware [
Dwork et al.
,
2001
,
Fagin et al.
,
2003
]. They show that
the
Threshold
algorithm, originally proposed in the context of database middleware by
Fagin et al.
[
2003
], can be applied to identify a consensus top-
K
schema matchings almost as is. Unfortunately,
computing top-
K
schema matchings using the
Threshold
algorithm may require time exponential
in the size of the matched schemata. Therefore, techniques that exploit the specifics of the schema
matching problem were developed to create a simple algorithm, the
Matrix-Direct
algorithm, whose
time complexity is polynomial in the size of the matched schemata and the required
K
. The
Matrix-
Direct
algorithm can be applied for a certain broad class of problems, and it was extended to the
Matrix-Direct-with-Bounding
algorithm, which draws upon both the
Matrix-Direct
and
Thresh-
old
algorithms to address matching scenarios where the
Matrix-Direct
algorithm is inapplicable.
The
Threshold
and
Matrix-Direct-with-Bounding
algorithms were shown to be (complexity-wise)
mutually undominated—that is, there exist problem instances in which one algorithm performs dra-
matically better than the other. This was resolved by introduction of the
CrossThreshold
algorithm,
a hybrid version of these two algorithms, based on their in-parallel,
mutually-enhancing
execution.