Top-K Schema Matchings - Uncertain Schema Matching

Databases Reference

In-Depth Information

leaf u j and adding its children, constructed by the algorithm at iteration i

1. By the induction

hypothesis, any matching different from σ 1 ,σ 2 ..., σ i

must satisfy constraints implied by S i (u l ) and

n i . Therefore, any matching σ different from σ 1 ,σ 2 ..., σ i must satisfy

one of the latter constraints. In addition, if σ satisfies constraints implied by S i (u j ) and S e (u j ) ,

which correspond to the leaf expanded at the i -th iteration, then by Lemma 5.7 it must also satisfy

constraints implied by one of the children of u j . We get that σ satisfies constraints S i (x p ) and S e (x p )

for some p , 1 ≤ p ≤ n i + 1 , proving the induction step.

S e (u l ) for some l , 1

≤

Algorithm 4 generates top-K matchings in G.

Theorem 5.9

Proof. The correctness of the theorem follows by Lemma 5.6 and Lemma 5.8 .

Finally, we analyze the complexity of Algorithm 4 . The algorithm performs k iterations.

Each iteration requires invocation of the algorithm A best O( | V | ) times and finding the maximum

matching on the tree frontier in O(log(k)) . We conclude that the overall complexity of the algorithm

is O(klog(k) + k | V | C(A best )) .

5.3

EXTENDING TOP- K IDENTIFICATION TO ENSEMBLES

In this section, we present several algorithms for connecting the ensemble approach of Chapter 4

and the top- K approach, increasing the robustness of the schema matching process by enjoying the

best of both worlds. Domshlak et al. [ 2007 ] introduced a generic computational framework, schema

metamatching , for computing the top- K prefix of a “consensus” ranking of alternative matchings be-

tween two schemata, given the graded valid schema matchings provided individually by the members

of an ensemble. Their starting point is based on rank aggregation techniques developed in the areas

of Web search and database middleware [ Dwork et al. , 2001 , Fagin et al. , 2003 ]. They show that

the Threshold algorithm, originally proposed in the context of database middleware by Fagin et al.

[ 2003 ], can be applied to identify a consensus top- K schema matchings almost as is. Unfortunately,

computing top- K schema matchings using the Threshold algorithm may require time exponential

in the size of the matched schemata. Therefore, techniques that exploit the specifics of the schema

matching problem were developed to create a simple algorithm, the Matrix-Direct algorithm, whose

time complexity is polynomial in the size of the matched schemata and the required K . The Matrix-

Direct algorithm can be applied for a certain broad class of problems, and it was extended to the

Matrix-Direct-with-Bounding algorithm, which draws upon both the Matrix-Direct and Thresh-

old algorithms to address matching scenarios where the Matrix-Direct algorithm is inapplicable.

The Threshold and Matrix-Direct-with-Bounding algorithms were shown to be (complexity-wise)

mutually undominated—that is, there exist problem instances in which one algorithm performs dra-

matically better than the other. This was resolved by introduction of the CrossThreshold algorithm,

a hybrid version of these two algorithms, based on their in-parallel, mutually-enhancing execution.

Search WWH ::

Custom Search

Home