Databases Reference
In-Depth Information
have instead opted for a comparison of the results of the mappings, e.g., the target
instances.
Nevertheless, the precision, recall, and the f-measure can be used to evaluate the
large class of tools that do not differentiate among the matching and the mapping
process but consider the whole task as a monolithic procedure. Spicy [ Bonifati et al.
2008a ] is an example of such tools, as it pipelines a matching module and a map-
ping generation module and allows the mapping designer to reiterate between the
two processes to improve the quality of the generated mappings. In Spicy, the map-
ping tasks were designed in such a way that the source always contains a mapping
that covers the entire target, meaning that no subset of the target schema remains
unmapped. The set of mapping scenarios in the system are built in such a way that
for a target schema, the correct set of matches that will generate a given predeter-
mined mapping is internally identified. These matches are called the ideal match
M id . At this point, the mapping generation algorithm can be run, and a single
transformation, T best , i.e., the mapping that has the best scores in terms of instance
similarity (cfr. next section for details), can be generated. Then, the matches
M T best
on which this mapping is based upon are identified. In the ideal case, these matches
are the same as the ideal match
M id . The quality of the tool can be measured in
terms of precision and recall of
M id . However, Spicy reports
quality only in terms of precision. The reason is that in all cases, the tool returns a
number of matches that is equal to the size of the target, as mentioned above. As a
consequence, precision and recall are both equal to the number of correct matches
in
M T best with respect to
M T best over the size of the target, which means that either precision or recall
suffices to characterize the quality of the generated mappings.
The cases in which the source does not contain a mapping that covers the entire
target are more complex and have not so far been addressed. It is believed that the
most general case in which the target schema is not entirely covered by the mapping
entails a new class of mapping tasks in which the target instance is partially filled
with data exchanged with the source and partially filled with its own data.
The problem of characterizing mappings in a quantitative way has also been stud-
ied [ Fagin et al. 2009b ] through the notion of information loss , which is introduced
to measure how much a schema mapping deviates from an ideal invertible mapping.
An invertible mapping is a mapping that given the generated target instance can be
used to regenerate the original source instance. A first definition of invertibility has
considered only constants in the source instance and constants alongside labeled
nulls in the target (cfr. [ Fagin et al. 2011 ]). Labeled nulls are generated values for
elements in the target that require a value, but the mapping provides no specifica-
tion for that value. In the inversion, these labeled nulls can propagate in the source
instance, resulting into an instance that has less information that the original one. To
capture in a precise way such an information loss, the notion of maximum extended
recovery has been introduced for tgds with disjunction and inequalities [ Fagin et al.
2009b ]. This new metric clearly identifies a viable approach to precisely compare
schema mappings, but the full potential of this metric in benchmarking mapping
tools still remains to be explored.
Search WWH ::




Custom Search