Databases Reference
In-Depth Information
correspondence that belongs to the exact matching is considered to be correct, even if the com-
plex matching is not fully captured. This method aims at compensating the matchers for the 1 : 1
cardinality enforcement.
The experiment provides a comparative analysis of the performance of the 2LNB heuristic
with four heuristics that enforce a matching cardinality of 1
1. Figure 4.1 illustrates the results.
The x axis represents the four different datasets, with precision on the y axis in Figure 4.1 (top) and
recall in Figure 4.1 (bottom).
In terms of precision, the 2LNB heuristic outperforms all other heuristics. For the real dataset,
this improvement in precision comes at the cost of recall. This disadvantage disappears in the sim-
ulated data, where the 2LNB heuristic dominates other heuristics, even for the simulated data with
n 1 , where the 1 : 1 cardinality constraint holds (although not enforced for the proposed heuristic).
For this case, the composition heuristic comes in very close behind.
Two observations in particular may explain this behavior. First, the naïve assumption of inde-
pendence does not hold in this set of experiments, since OntoBuilder heuristics are all heavily based
on syntactic comparisons. Second, it is possible that the training dataset used to determine the beta
distributions does not serve as a good estimator for the matchers' decision making. The latter can be
improved using statistical methods for outlier elimination. For the former, a method for ensemble
matcher selection is needed. Such a method is discussed next.
:
4.3
CONSTRUCTING ENSEMBLES
Choosing among schema matchers is far from trivial. First, the number of schema matchers is
continuously growing, and this diversity by itself complicates the choice of the most appropriate tool
for a given application domain. Second, as one would expect, empirical analysis shows that there is
not (and may never be) a single dominant schema matcher that performs best, regardless of the data
model and application domain [ Gal et al. , 2005a ].
Most research work devoted to constructing ensembles deals with setting the relative impact
of each participating matcher. For example, consider Meta-Learner [ Doan et al. , 2001 ] and On-
toBuilder [ Marie and Gal , 2008 ]. In both tools, a weighted average of the decisions taken by the
matchers in an ensemble determines the matching outcome. Doan et al. [ 2001 ] sets the weights
using a least-square linear regression analysis, while Marie and Gal [ 2008 ] use the boosting mech-
anism (to be described shortly). The literature shows a connection between boosting and logistic
regression [ Schapire , 2001 ], yet there is no evident connection to linear regression.
Research has shown that many schema matchers perform better than random choice. We
argue that any (statistically) monotonic matcher is a weak classifier [ Schapire , 1990 ]—a classifier
that is only slightly correlated with the true classification. A weak classifier for binary classification
problems is any algorithm that achieves a weighted empirical error on the training set which is
bounded from above by 1 / 2
γ,γ > 0 for some distribution on the dataset (the dataset consists of
weighted examples that sum to unity). In other words, it can produce a hypothesis that performs at
least slightly better than random choice. The theory of weak classifiers has led to the introduction of
Search WWH ::




Custom Search