On Evaluating Schema Matching and Mapping - Schema Matching and Mapping

Databases Reference

In-Depth Information

scenarios, i.e., the set of expected matches/mappings. This oracle is typically an

expert user. A match/mapping generated by a tool is characterized as correct if it is

part of the ground truth, or incorrect, otherwise. The successful implementation of

a scenario by a tool is the generation of the expected matches/mappings.

Provided with a rich set of mapping scenarios, one can test different aspects of

a mapping tool. The effectiveness of the tool is the percentage of these scenarios

that the tool could successfully implement. This approach is the one followed by

STBenchmark [ Alexe et al. 2008b ]. The scenarios the benchmark provides have

been collected from the related scientific literature and real-world applications.

The characterization of the effectiveness of a tool based on the notion of the

successful or unsuccessful implementation of scenarios may not be the optimal

approach especially in the case of systems. Very often, a mapping tool may not

be able to produce exactly the expected mappings, yet it may be able to generate a

pretty good approximation of them, or mappings that produce a target instance very

close to the expected one. Under the above model, such a tool will be unfairly penal-

ized as unsuccessful, even though the final result is very close to the one expected.

For this reason, a metric measuring proximity of the produced results to the expected

is becoming an increasingly popular alternative.

7.2

Quality of the Generated Matchings/Mappings

Four metrics that have been used extensively in the area of matching tool eval-

uation are the precision , recall , f-measure ,andthe fallout [ Euzenat and Shvaiko

2007 ]. They are all intended to quantify the proximity of the results generated by

a matching tool to those expected. They are based on the notions of true positives ,

false positives , true negatives ,and false negatives . Given two schemas S and T ,

let

represent the set of all possible matches that can exist between their respec-

tive elements. Assume that an oracle provides the list of expected matches. These

matches are referred to as relevant , and all the other matches in

M

as irrelevant .The

matching tool provides a list of matches that it considers true. These are the tool rel-

evant matches , while the remaining matches in

M

are the tool irrelevant matches .

A match in

is characterized as true positive, false positive, true negative, or false

negative, depending on which of the above sets it belongs. The respective definitions

are illustrated in Table 9.1 .

The precision, recall, and f-measure [ Van-Risbergen 1979 ] are well known from

the information retrieval domain. They return a real value between 0 and 1 and have

been used in many matching evaluation efforts [ Duchateau et al. 2007 ; Do et al.

2002 ]. Figure 9.8 depicts a matching example. It illustrates two schemas related

M

Tabl e 9. 1

Contingency table forming the base of evaluation measures

Relevant matches

Irrelevant matches

Tool relevant matches

TP (true positive)

FP (false positive)

Tool irrelevant matches

FN (false negative)

TN (true negative)

Schema Matching and Mapping

Search WWH ::

Custom Search

Home