On Evaluating Schema Matching and Mapping - Schema Matching and Mapping

Databases Reference

In-Depth Information

have instead opted for a comparison of the results of the mappings, e.g., the target

instances.

Nevertheless, the precision, recall, and the f-measure can be used to evaluate the

large class of tools that do not differentiate among the matching and the mapping

process but consider the whole task as a monolithic procedure. Spicy [ Bonifati et al.

2008a ] is an example of such tools, as it pipelines a matching module and a map-

ping generation module and allows the mapping designer to reiterate between the

two processes to improve the quality of the generated mappings. In Spicy, the map-

ping tasks were designed in such a way that the source always contains a mapping

that covers the entire target, meaning that no subset of the target schema remains

unmapped. The set of mapping scenarios in the system are built in such a way that

for a target schema, the correct set of matches that will generate a given predeter-

mined mapping is internally identified. These matches are called the ideal match

M id . At this point, the mapping generation algorithm can be run, and a single

transformation, T best , i.e., the mapping that has the best scores in terms of instance

similarity (cfr. next section for details), can be generated. Then, the matches

M T best

on which this mapping is based upon are identified. In the ideal case, these matches

are the same as the ideal match

M id . The quality of the tool can be measured in

terms of precision and recall of

M id . However, Spicy reports

quality only in terms of precision. The reason is that in all cases, the tool returns a

number of matches that is equal to the size of the target, as mentioned above. As a

consequence, precision and recall are both equal to the number of correct matches

in

M T best with respect to

M T best over the size of the target, which means that either precision or recall

suffices to characterize the quality of the generated mappings.

The cases in which the source does not contain a mapping that covers the entire

target are more complex and have not so far been addressed. It is believed that the

most general case in which the target schema is not entirely covered by the mapping

entails a new class of mapping tasks in which the target instance is partially filled

with data exchanged with the source and partially filled with its own data.

The problem of characterizing mappings in a quantitative way has also been stud-

ied [ Fagin et al. 2009b ] through the notion of information loss , which is introduced

to measure how much a schema mapping deviates from an ideal invertible mapping.

An invertible mapping is a mapping that given the generated target instance can be

used to regenerate the original source instance. A first definition of invertibility has

considered only constants in the source instance and constants alongside labeled

nulls in the target (cfr. [ Fagin et al. 2011 ]). Labeled nulls are generated values for

elements in the target that require a value, but the mapping provides no specifica-

tion for that value. In the inversion, these labeled nulls can propagate in the source

instance, resulting into an instance that has less information that the original one. To

capture in a precise way such an information loss, the notion of maximum extended

recovery has been introduced for tgds with disjunction and inequalities [ Fagin et al.

2009b ]. This new metric clearly identifies a viable approach to precisely compare

schema mappings, but the full potential of this metric in benchmarking mapping

tools still remains to be explored.

Schema Matching and Mapping

Search WWH ::

Custom Search

Home