Databases Reference
In-Depth Information
leaf u j and adding its children, constructed by the algorithm at iteration i
+
1. By the induction
hypothesis, any matching different from σ 1 2 ..., σ i
must satisfy constraints implied by S i (u l ) and
n i . Therefore, any matching σ different from σ 1 2 ..., σ i must satisfy
one of the latter constraints. In addition, if σ satisfies constraints implied by S i (u j ) and S e (u j ) ,
which correspond to the leaf expanded at the i -th iteration, then by Lemma 5.7 it must also satisfy
constraints implied by one of the children of u j . We get that σ satisfies constraints S i (x p ) and S e (x p )
for some p , 1 p n i + 1 , proving the induction step.
S e (u l ) for some l , 1
l
Algorithm 4 generates top-K matchings in G.
Theorem 5.9
Proof. The correctness of the theorem follows by Lemma 5.6 and Lemma 5.8 .
Finally, we analyze the complexity of Algorithm 4 . The algorithm performs k iterations.
Each iteration requires invocation of the algorithm A best O( | V | ) times and finding the maximum
matching on the tree frontier in O(log(k)) . We conclude that the overall complexity of the algorithm
is O(klog(k) + k | V | C(A best )) .
5.3
EXTENDING TOP- K IDENTIFICATION TO ENSEMBLES
In this section, we present several algorithms for connecting the ensemble approach of Chapter 4
and the top- K approach, increasing the robustness of the schema matching process by enjoying the
best of both worlds. Domshlak et al. [ 2007 ] introduced a generic computational framework, schema
metamatching , for computing the top- K prefix of a “consensus” ranking of alternative matchings be-
tween two schemata, given the graded valid schema matchings provided individually by the members
of an ensemble. Their starting point is based on rank aggregation techniques developed in the areas
of Web search and database middleware [ Dwork et al. , 2001 , Fagin et al. , 2003 ]. They show that
the Threshold algorithm, originally proposed in the context of database middleware by Fagin et al.
[ 2003 ], can be applied to identify a consensus top- K schema matchings almost as is. Unfortunately,
computing top- K schema matchings using the Threshold algorithm may require time exponential
in the size of the matched schemata. Therefore, techniques that exploit the specifics of the schema
matching problem were developed to create a simple algorithm, the Matrix-Direct algorithm, whose
time complexity is polynomial in the size of the matched schemata and the required K . The Matrix-
Direct algorithm can be applied for a certain broad class of problems, and it was extended to the
Matrix-Direct-with-Bounding algorithm, which draws upon both the Matrix-Direct and Thresh-
old algorithms to address matching scenarios where the Matrix-Direct algorithm is inapplicable.
The Threshold and Matrix-Direct-with-Bounding algorithms were shown to be (complexity-wise)
mutually undominated—that is, there exist problem instances in which one algorithm performs dra-
matically better than the other. This was resolved by introduction of the CrossThreshold algorithm,
a hybrid version of these two algorithms, based on their in-parallel, mutually-enhancing execution.
Search WWH ::




Custom Search