Tuning for Schema Matching - Schema Matching and Mapping

Databases Reference

In-Depth Information

6.3

YAM

Yet another matcher (YAM) [ Duchateau et al. 2009a , b ] enables the generation of

`alacarte schema matchers according to user requirements. It uses a knowledge

base that includes a (possibly large) set of similarity measures and machine learn-

ing classifiers. All classifiers are trained with scenarios from this knowledge base

(and optionally provided by the users). Their individual results (precision, recall and

F-measure), are computed and according to the adopted strategy, the classifier that

achieves the best quality is selected as schema matcher. The strategy mainly depends

on user inputs. For instance, if (s)he wants to promote recall, then the classifier with

the best recall value is returned. If the user has provided expert mappings, then YAM

selects as the schema matcher the classifier that obtains the best F-measure on this

set of expert mappings.

6.4

SMB

In Anan and Avigdor [ 2008 ], the authors propose a machine learning approach,

SMB. It uses the Boosting algorithm to classify the similarity measures, divided

into first-line and second-line matchers. The Boosting algorithm consists in iterating

weak classifiers over the training set while re-adjusting the importance of elements

in this training set. Thus, SMB automatically selects a pair of similarity measures

as a matcher by focusing on harder training data. An advantage of this algorithm is

the important weight given to misclassified pairs during the training. Although this

approach makes use of several similarity measures, it mainly combines a similarity

measure (first-line matcher) with a decision maker (second-line matcher). Empiri-

cal results show that the selection of the pair does not depend on their individual

performance.

6.5

STEM

In a broader way, the STEM framework [ Kopcke and Rahm 2008 ] identifies the

most interesting training data set which is then used to combine matching strate-

gies and tune several parameters such as thresholds. First, training data is generated,

either manually (i.e., an expert labels the entity pairs) or automatically (at random,

using static-active selection or active learning). Then, similarity values are com-

puted using pairs in the training data set to build a similarity matrix between each

pair and each similarity measure. Finally, the matching strategy is deduced from

this matrix, thanks to supervised learned algorithm. The output is a tuned matching

strategy (how to combine similarity measures and tune their parameters). The frame-

work enables a comparative study of various similarity measures (e.g., Trigrams,

Jaccard) combined with different strategies (e.g., decision tree, linear regression)

whose parameters are either manually or automatically tuned.

Search WWH ::

Custom Search

Home