Databases Reference
In-Depth Information
Table 10.1
Impact of the number of training data on the matching quality with Apfel's
decision tree
Dataset
Number of training data
Precision
Recall
F-measure
Russia
20
83%
48%
60%
50
82%
47%
60%
150
72%
59%
65%
Biblio
20
01%
28%
01%
50
46%
25%
32%
150
63%
38%
47%
efficient, since this woud not be realistic. For example, if a user needs to match 100
data sources, (s)he can manually find the mappings for a few data sources and LSD
discovers the others for the remaining sources [ Doan et al. 2001 ]. Due to the avail-
ability of training data and the classifier used, tuning this parameter is complicated.
To illustrate this, we have partly reproduced a table from Ehrig et al. [ 2005 ], shown
as Table 10.1 . We have limited this excerpt to two matching datasets ( Russia and
biblio ) and to one Apfel's classifier (the decision tree). It depicts how the number
of training data has a significant impact on the matching quality (in terms of preci-
sion, recall and F-measure). For instance, we notice that providing 20 training data
in the Russia dataset enables the best precision (83%). This precision value tends to
decrease with more training data. On the contrary, using 20 training data with the
biblio dataset is clearly not sufficient.
Not only the number of training data may be crucial, but their validity also. For
instance, APFEL [ Ehrig et al. 2005 ] uses both positive and negative examples for
training its classifiers. In this context, it is easier to provide sufficient training data to
the system: authors explain that an initial matcher performs a matching over sample
data and let users rate the discovered correspondences. The rated list of correspon-
dences is then given as input to APFEL. From this list, the tools determines heuristic
weights and threshold levels using various machine learning techniques, namely
decision trees, neural networks, and support vector machines.
Another work aims at classifying candidate correspondences (either as relevant
or not) by analysing their features [ Naumann et al. 2002 ]. The features represent
boolean properties over data instance, such as presence of delimiters. Thus, selecting
an appropriate feature set is a first parameter to deal with. The choice of a classifier
is also important, and authors propose, by default, the Naive Bayes classifier for
categorical data and quantile-based classifier for numerical data.
Similarity measures based on machine learning may not always stand for the
most effective. The ASID matcher [ Bozovic and Vassalos 2008 ] considers its Naive
Bayes classifier (against schema instances) as a less credible similarity measure,
which is applied after user (in)validation of initial results provided by more reliable
measures (Jaro and TF/IDF). We think that this credibility of machine learning-
based similarity measures heavily depends on the quality of their training data.
 
Search WWH ::




Custom Search