Tuning for Schema Matching - Schema Matching and Mapping

Databases Reference

In-Depth Information

Table 10.1

Impact of the number of training data on the matching quality with Apfel's

decision tree

Dataset

Number of training data

Precision

Recall

F-measure

Russia

20

83%

48%

60%

50

82%

47%

60%

150

72%

59%

65%

Biblio

20

01%

28%

01%

50

46%

25%

32%

150

63%

38%

47%

efficient, since this woud not be realistic. For example, if a user needs to match 100

data sources, (s)he can manually find the mappings for a few data sources and LSD

discovers the others for the remaining sources [ Doan et al. 2001 ]. Due to the avail-

ability of training data and the classifier used, tuning this parameter is complicated.

To illustrate this, we have partly reproduced a table from Ehrig et al. [ 2005 ], shown

as Table 10.1 . We have limited this excerpt to two matching datasets ( Russia and

biblio ) and to one Apfel's classifier (the decision tree). It depicts how the number

of training data has a significant impact on the matching quality (in terms of preci-

sion, recall and F-measure). For instance, we notice that providing 20 training data

in the Russia dataset enables the best precision (83%). This precision value tends to

decrease with more training data. On the contrary, using 20 training data with the

biblio dataset is clearly not sufficient.

Not only the number of training data may be crucial, but their validity also. For

instance, APFEL [ Ehrig et al. 2005 ] uses both positive and negative examples for

training its classifiers. In this context, it is easier to provide sufficient training data to

the system: authors explain that an initial matcher performs a matching over sample

data and let users rate the discovered correspondences. The rated list of correspon-

dences is then given as input to APFEL. From this list, the tools determines heuristic

weights and threshold levels using various machine learning techniques, namely

decision trees, neural networks, and support vector machines.

Another work aims at classifying candidate correspondences (either as relevant

or not) by analysing their features [ Naumann et al. 2002 ]. The features represent

boolean properties over data instance, such as presence of delimiters. Thus, selecting

an appropriate feature set is a first parameter to deal with. The choice of a classifier

is also important, and authors propose, by default, the Naive Bayes classifier for

categorical data and quantile-based classifier for numerical data.

Similarity measures based on machine learning may not always stand for the

most effective. The ASID matcher [ Bozovic and Vassalos 2008 ] considers its Naive

Bayes classifier (against schema instances) as a less credible similarity measure,

which is applied after user (in)validation of initial results provided by more reliable

measures (Jaro and TF/IDF). We think that this credibility of machine learning-

based similarity measures heavily depends on the quality of their training data.

Schema Matching and Mapping

Search WWH ::

Custom Search

Home