Databases Reference
In-Depth Information
3.2.2
Parameters at the Matcher Level
The second category of matchers use machine learning techniques to combine sim-
ilarity measures. However, they share almost the same parameters than the first
category.
SMB [ Anan and Avigdor 2008 ] is based on the Boosting algorithm. In addition
to training data, this approach also needs two parameters. The former is a hypothesis
space, which is in this case a pair of similarity measures chosen among a pool. It
appears that the similarity measures that perform well when used alone are mainly
not included in the hypothesis space when combined with another one. The latter is
an error measure, which aims at both stopping the algorithm (when the computed
error value reaches a threshold, 0:5 by default) and selecting at each iteration the
similarity measure which produced less errors. The authors have noticed that this
error value is quickly reached, and therefore have added a pre-processing step to
remove all pairs of schema elements that have been classified as irrelevant by all
classifiers.
In YAM [ Duchateau et al. 2009a , b ], the number of training data, extracted from
a knowledge base, is either provided by users or chosen according to empirical eval-
uation results. This tool can also be trained with similar schemas. This means that
users may already have schemas that have been matched and could be reused to
improve the results. Similarly, authors indicate that the schemas belong to either
the same domain (e.g., biology, business) or share some features (e.g., degree of
heterogeneity, nested structure).
3.3
External Resources
External resources have long been useful to bring reliable knowledge into the
schema matching process. In addition to the availability and security issues, user
should check the adequacy of the resource content for the given matching task
and its integration within the matcher. Different types of resources are accepted
by schema matchers. The simplest one is a list of similar labels, also called list of
synonyms. COMA
[ Aumueller et al. 2005 ]andPorsche[ Saleem et al. 2008 ]
let users fill in these resources. List of abbreviations are very common to extend
the labels of ambiguous schema elements, such as in COMA
CC
CC
[ Aumueller et al.
2005 ].
Another type of external resources is the domain ontology, used by Quickmig
[ Drumm et al. 2007 ] for instance. Similarly, Porsche [ Saleem and Bellahsene 2009 ]
is enhanced by data mining techniques applied to many domain ontologies to extract
mini-taxonomies, that are finally used to discover complex mappings.
The Wordnet dictionary [ Wordnet 2007 ] is also used in different fashions: it
facilitates the discovery of various relationships (e.g., synonyms, antonyms) in
approaches such as YAM [ Duchateau et al. 2009a , b ] and S-MATCH/S-MATCH
CC
[ Giunchiglia et al. 2004 ; Avesani et al. 2005 ]. A dictionary can also become the
core of the system against which all schema elements are matched, as performed by
AUTOPLEX/AUTOMATCH [ Berlin and Motro 2001 , 2002 ].
Search WWH ::




Custom Search