Databases Reference
In-Depth Information
CHAPTER
5
Top-
K
Schema Matchings
The quest for certainty blocks the search for meaning.
-ErichFromm
Top-
K
schema matchings are intuitively defined as a ranked list of the best
K
schema match-
ings a matcher can generate. The formal definition, given in Section
5.1
, is recursive, providing
interesting insights into the behavior patterns of matchers. Top-
K
schema matchings play a pivotal
role in managing uncertain schema matching. The effectively unlimited heterogeneity and ambigu-
ity of data description suggests that in many cases an exact matching will not be identified as a best
matching by any schema matcher. Therefore, top-
K
schema matchings can be used to create a search
space in uncertain settings [
Anaby-Tavor
,
2003
,
Bernstein et al.
,
2006
] and can serve in assigning
probabilities in probabilistic schema matchings [
Roitman et al.
,
2008
]. They can also help improve
the precision of matching results [
Gal
,
2006
]. All in all, empirical results have shown the top-
K
list
to be a quality set of candidates for schema matching.
We start this chapter with a formal definition of top-
K
schema matchings (Section
5.1
).
Section
5.2
provides three efficient algorithms for finding top-
K
schema matchings. We connect
top-
K
schema matching with ensemble decision making in Section
5.3
and detail a heuristic that
uses top-
K
schema matchings to improve on matcher performance in Section
5.4
. We conclude
with a method for using top-
K
schema matchings to assign probabilities to probabilistic attribute
correspondences (Section
5.5
).
5.1
TOP-
K
SCHEMA MATCHINGS: DEFINITION
We now provide a formal definition of top-
K
schema matching using a bipartite graph representation.
For simplicity, we restrict the discussion to 1
1 matchings. Extensions can be made by using
hypergraphs, in which edges connect subsets of nodes rather than individual nodes.
Let
G
:
(X,Y,E)
be an undirected bipartite graph with nodes representing attributes of
two schemata and edges representing the degree of similarity between attributes. Assume a problem
instance with a positive weight function
=
(
0
,
1] defined on edges. Given a schema matcher
and a similarity matrix
M
,
(i, j)
=
M
i,j
. It is worth noting that
G
contains no edges with 0
weight. Therefore, whenever
M
i,j
:
E
→
0, there is no edge between attributes
i
and
j
in
G
. A matching
σ
is a subset of
G
's edges,
σ
⊆
E
.Therefore,
σ
⊆
E
is equivalent to
σ
∈
.The weight of a matching
σ
is
f(σ,M)
, as defined in Section
4.1
. Given a constraint specification
(see Section
3.1.3
), we
consider hereafter only valid schema matchings in
.
=