Tuning for Schema Matching - Schema Matching and Mapping

Databases Reference

In-Depth Information

3.2.2

Parameters at the Matcher Level

The second category of matchers use machine learning techniques to combine sim-

ilarity measures. However, they share almost the same parameters than the first

category.

SMB [ Anan and Avigdor 2008 ] is based on the Boosting algorithm. In addition

to training data, this approach also needs two parameters. The former is a hypothesis

space, which is in this case a pair of similarity measures chosen among a pool. It

appears that the similarity measures that perform well when used alone are mainly

not included in the hypothesis space when combined with another one. The latter is

an error measure, which aims at both stopping the algorithm (when the computed

error value reaches a threshold, 0:5 by default) and selecting at each iteration the

similarity measure which produced less errors. The authors have noticed that this

error value is quickly reached, and therefore have added a pre-processing step to

remove all pairs of schema elements that have been classified as irrelevant by all

classifiers.

In YAM [ Duchateau et al. 2009a , b ], the number of training data, extracted from

a knowledge base, is either provided by users or chosen according to empirical eval-

uation results. This tool can also be trained with similar schemas. This means that

users may already have schemas that have been matched and could be reused to

improve the results. Similarly, authors indicate that the schemas belong to either

the same domain (e.g., biology, business) or share some features (e.g., degree of

heterogeneity, nested structure).

3.3

External Resources

External resources have long been useful to bring reliable knowledge into the

schema matching process. In addition to the availability and security issues, user

should check the adequacy of the resource content for the given matching task

and its integration within the matcher. Different types of resources are accepted

by schema matchers. The simplest one is a list of similar labels, also called list of

synonyms. COMA

[ Aumueller et al. 2005 ]andPorsche[ Saleem et al. 2008 ]

let users fill in these resources. List of abbreviations are very common to extend

the labels of ambiguous schema elements, such as in COMA

CC

[ Aumueller et al.

2005 ].

Another type of external resources is the domain ontology, used by Quickmig

[ Drumm et al. 2007 ] for instance. Similarly, Porsche [ Saleem and Bellahsene 2009 ]

is enhanced by data mining techniques applied to many domain ontologies to extract

mini-taxonomies, that are finally used to discover complex mappings.

The Wordnet dictionary [ Wordnet 2007 ] is also used in different fashions: it

facilitates the discovery of various relationships (e.g., synonyms, antonyms) in

approaches such as YAM [ Duchateau et al. 2009a , b ] and S-MATCH/S-MATCH

CC

[ Giunchiglia et al. 2004 ; Avesani et al. 2005 ]. A dictionary can also become the

core of the system against which all schema elements are matched, as performed by

AUTOPLEX/AUTOMATCH [ Berlin and Motro 2001 , 2002 ].

Schema Matching and Mapping

Search WWH ::

Custom Search

Home