On Evaluating Schema Matching and Mapping - Schema Matching and Mapping

Databases Reference

In-Depth Information

the core. The core [ Fagin et al. 2003 ] is a minimum universal solution [ Fagin et al.

2005 ]. Core identification has been shown to be a co-NP hard problem [ Fagin et al.

2005 ] for certain mapping dependencies. Despite these complexity results, there

have been successful developments of efficient techniques that given two schemas

and a set of mapping dependencies between them, in the form of tuple generating

dependencies, produce a set of transformation scripts, e.g., in XSLT or SQL, whose

execution efficiently generates a core target instance [ Mecca et al. 2009 ; ten Cate

et al. 2009 ].

Time performance is becoming particularly critical in ETL tools that typically

deal with large volumes of data. Recent ETL benchmarks [ Simitsis et al. 2009 ]

consider it as one of the major factors of every ETL tool evaluation. Other similar

factors that are also mentioned in ETL benchmarks are the workflow execution

throughput, the average latency per tuple, and the workflow execution throughput

under failures. The notion of time performance in ETL tools extends beyond the end

of the ETL workflow construction by considering, apart from the data translation

time, the time required to answer business-level queries on the transformed data.

[Parallelization] One way to improve the data transformation time is to increase

parallelization by generating mappings with minimum interdependencies. There are

in general two broad categories of parallel processing: pipelining and partitioning.

In pipelining, different parts of the transformation are executed in parallel in a sys-

tem with more than one processor, and the data generated by one component are

consumed immediately by another component without the need of waiting for the

first component to fully complete its task. Pipelining works well for transformations

that do not involve extremely large amounts of data. If this is not the case, a differ-

ent parallelization mechanism called partitioning is preferable. In partitioning, the

data is first divided into different parts, and then, the transformation described by

the mappings is applied on each partition independently of the others [ Simitsis et al.

2009 ].

6.3

Human Effort

Since the goal of a matching or mapping tool is to alleviate the designer from the

laborious task of matching and mapping specification, it is natural to consider as one

of the evaluation metrics of such a tool the effort required by the mapping designer.

In a schema matching task, the input consists of only the two schemas. Since the

task involves semantics, the designer must go through all the produced matches and

verify their correctness. Consequently, the effort the designer needs to spend during

a matching task can be naively quantified by the number of matches produced by

the matcher and by their complexity.

A matcher may produce not only false positives but also false negatives, which

the matching designer will have to add manually to the result of the matcher, or

will have to tune the tool to generate them. Two metrics have been proposed in the

literature for quantifying this effort. One is the overall , which is also found under

Search WWH ::

Custom Search

Home