Schema Matcher Ensembles - Uncertain Schema Matching

Databases Reference

In-Depth Information

correspondence that belongs to the exact matching is considered to be correct, even if the com-

plex matching is not fully captured. This method aims at compensating the matchers for the 1 : 1

cardinality enforcement.

The experiment provides a comparative analysis of the performance of the 2LNB heuristic

with four heuristics that enforce a matching cardinality of 1

1. Figure 4.1 illustrates the results.

The x axis represents the four different datasets, with precision on the y axis in Figure 4.1 (top) and

recall in Figure 4.1 (bottom).

In terms of precision, the 2LNB heuristic outperforms all other heuristics. For the real dataset,

this improvement in precision comes at the cost of recall. This disadvantage disappears in the sim-

ulated data, where the 2LNB heuristic dominates other heuristics, even for the simulated data with

n 1 , where the 1 : 1 cardinality constraint holds (although not enforced for the proposed heuristic).

For this case, the composition heuristic comes in very close behind.

Two observations in particular may explain this behavior. First, the naïve assumption of inde-

pendence does not hold in this set of experiments, since OntoBuilder heuristics are all heavily based

on syntactic comparisons. Second, it is possible that the training dataset used to determine the beta

distributions does not serve as a good estimator for the matchers' decision making. The latter can be

improved using statistical methods for outlier elimination. For the former, a method for ensemble

matcher selection is needed. Such a method is discussed next.

:

4.3

CONSTRUCTING ENSEMBLES

Choosing among schema matchers is far from trivial. First, the number of schema matchers is

continuously growing, and this diversity by itself complicates the choice of the most appropriate tool

for a given application domain. Second, as one would expect, empirical analysis shows that there is

not (and may never be) a single dominant schema matcher that performs best, regardless of the data

model and application domain [ Gal et al. , 2005a ].

Most research work devoted to constructing ensembles deals with setting the relative impact

of each participating matcher. For example, consider Meta-Learner [ Doan et al. , 2001 ] and On-

toBuilder [ Marie and Gal , 2008 ]. In both tools, a weighted average of the decisions taken by the

matchers in an ensemble determines the matching outcome. Doan et al. [ 2001 ] sets the weights

using a least-square linear regression analysis, while Marie and Gal [ 2008 ] use the boosting mech-

anism (to be described shortly). The literature shows a connection between boosting and logistic

regression [ Schapire , 2001 ], yet there is no evident connection to linear regression.

Research has shown that many schema matchers perform better than random choice. We

argue that any (statistically) monotonic matcher is a weak classifier [ Schapire , 1990 ]—a classifier

that is only slightly correlated with the true classification. A weak classifier for binary classification

problems is any algorithm that achieves a weighted empirical error on the training set which is

bounded from above by 1 / 2

− γ,γ > 0 for some distribution on the dataset (the dataset consists of

weighted examples that sum to unity). In other words, it can produce a hypothesis that performs at

least slightly better than random choice. The theory of weak classifiers has led to the introduction of

Search WWH ::

Custom Search

Home