Databases Reference
In-Depth Information
6.3
YAM
Yet another matcher (YAM) [ Duchateau et al. 2009a , b ] enables the generation of
`alacarte schema matchers according to user requirements. It uses a knowledge
base that includes a (possibly large) set of similarity measures and machine learn-
ing classifiers. All classifiers are trained with scenarios from this knowledge base
(and optionally provided by the users). Their individual results (precision, recall and
F-measure), are computed and according to the adopted strategy, the classifier that
achieves the best quality is selected as schema matcher. The strategy mainly depends
on user inputs. For instance, if (s)he wants to promote recall, then the classifier with
the best recall value is returned. If the user has provided expert mappings, then YAM
selects as the schema matcher the classifier that obtains the best F-measure on this
set of expert mappings.
6.4
SMB
In Anan and Avigdor [ 2008 ], the authors propose a machine learning approach,
SMB. It uses the Boosting algorithm to classify the similarity measures, divided
into first-line and second-line matchers. The Boosting algorithm consists in iterating
weak classifiers over the training set while re-adjusting the importance of elements
in this training set. Thus, SMB automatically selects a pair of similarity measures
as a matcher by focusing on harder training data. An advantage of this algorithm is
the important weight given to misclassified pairs during the training. Although this
approach makes use of several similarity measures, it mainly combines a similarity
measure (first-line matcher) with a decision maker (second-line matcher). Empiri-
cal results show that the selection of the pair does not depend on their individual
performance.
6.5
STEM
In a broader way, the STEM framework [ Kopcke and Rahm 2008 ] identifies the
most interesting training data set which is then used to combine matching strate-
gies and tune several parameters such as thresholds. First, training data is generated,
either manually (i.e., an expert labels the entity pairs) or automatically (at random,
using static-active selection or active learning). Then, similarity values are com-
puted using pairs in the training data set to build a similarity matrix between each
pair and each similarity measure. Finally, the matching strategy is deduced from
this matrix, thanks to supervised learned algorithm. The output is a tuned matching
strategy (how to combine similarity measures and tune their parameters). The frame-
work enables a comparative study of various similarity measures (e.g., Trigrams,
Jaccard) combined with different strategies (e.g., decision tree, linear regression)
whose parameters are either manually or automatically tuned.
Search WWH ::




Custom Search