Databases Reference
In-Depth Information
the core. The core [ Fagin et al. 2003 ] is a minimum universal solution [ Fagin et al.
2005 ]. Core identification has been shown to be a co-NP hard problem [ Fagin et al.
2005 ] for certain mapping dependencies. Despite these complexity results, there
have been successful developments of efficient techniques that given two schemas
and a set of mapping dependencies between them, in the form of tuple generating
dependencies, produce a set of transformation scripts, e.g., in XSLT or SQL, whose
execution efficiently generates a core target instance [ Mecca et al. 2009 ; ten Cate
et al. 2009 ].
Time performance is becoming particularly critical in ETL tools that typically
deal with large volumes of data. Recent ETL benchmarks [ Simitsis et al. 2009 ]
consider it as one of the major factors of every ETL tool evaluation. Other similar
factors that are also mentioned in ETL benchmarks are the workflow execution
throughput, the average latency per tuple, and the workflow execution throughput
under failures. The notion of time performance in ETL tools extends beyond the end
of the ETL workflow construction by considering, apart from the data translation
time, the time required to answer business-level queries on the transformed data.
[Parallelization] One way to improve the data transformation time is to increase
parallelization by generating mappings with minimum interdependencies. There are
in general two broad categories of parallel processing: pipelining and partitioning.
In pipelining, different parts of the transformation are executed in parallel in a sys-
tem with more than one processor, and the data generated by one component are
consumed immediately by another component without the need of waiting for the
first component to fully complete its task. Pipelining works well for transformations
that do not involve extremely large amounts of data. If this is not the case, a differ-
ent parallelization mechanism called partitioning is preferable. In partitioning, the
data is first divided into different parts, and then, the transformation described by
the mappings is applied on each partition independently of the others [ Simitsis et al.
2009 ].
6.3
Human Effort
Since the goal of a matching or mapping tool is to alleviate the designer from the
laborious task of matching and mapping specification, it is natural to consider as one
of the evaluation metrics of such a tool the effort required by the mapping designer.
In a schema matching task, the input consists of only the two schemas. Since the
task involves semantics, the designer must go through all the produced matches and
verify their correctness. Consequently, the effort the designer needs to spend during
a matching task can be naively quantified by the number of matches produced by
the matcher and by their complexity.
A matcher may produce not only false positives but also false negatives, which
the matching designer will have to add manually to the result of the matcher, or
will have to tune the tool to generate them. Two metrics have been proposed in the
literature for quantifying this effort. One is the overall , which is also found under
Search WWH ::




Custom Search