Databases Reference
In-Depth Information
new matching evaluation scenario. The selection of the portions was done in a way
that preserved five main properties: (1) the complexity of the matching operators,
(2) the incrementality, i.e., the ability to reveal weaknesses of the matching tool
under evaluation, (3) the ability to distinguish among the different matching solu-
tions, (4) the quality preservation, meaning that any matching quality measure
calculated on the subset of the schemas did not differ substantially from the measure
calculated on the whole dataset, and (5) the correctness, meaning that any matches
considered were correct.
A top-down approach has also been proposed for data exchange systems
[ Okawara et al. 2006 ] and is the model upon which the THALIA [ Hammer et al.
2005 ] integration benchmark is based. In particular, Thalia provides a large dataset
and the filters that can select portions of this dataset in terms of values and schemas.
eTuner [ Lee et al. 2007 ] is a tool for automatically tuning matchers that utilizes
the instance data in conjunction with the schema information and can also be used
to create synthetic scenarios in the top-down fashion. It starts with an initial schema,
and splits it into two, each keeping the same structure but half of the instance data.
The correct matches between the schemas generated by the split are known, and the
idea is to apply transformations to one of the two schemas to create a new schema.
The transformations are based on rules at three levels: (1) modifications on the struc-
ture of the schema, (2) changes of the schema element names, and (3) perturbations
of the data. The matchings between schema elements are traced through the whole
process so that they are known at the end and are used for evaluating the matchers.
A limitation of eTuner is that the user needs to create or find a reference ontology.
Furthermore, the set of modifications that can be performed on the data is limited,
making the perturbated data look less similar to natural real-world data.
In the bottom-up approach of synthetic scenario generation, some small scenario
is used as a seed for the construction of more complex scenarios. STBench-
mark [ Alexe et al. 2008b ] is based on this idea to provide synthetic mapping test
scenarios, i.e., a synthetic source schema, a target schema, an expected mapping
between the source and the target schema, and an instance of the source schema.
The seeds it uses are its basic scenarios that were mentioned in the previous sec-
tion. Given a basic scenario, STBenchmark constructs an expanded version of it.
The expanded version is an image of the original scenario but on a larger scale.
The scale is determined by dimensions specified through configuration parameters
representing characteristics of the schemas and the mappings. For instance, in a
copy basic scenario, the configuration parameters are the average nesting depth of
the schemas and the average number of attributes of each element. In the vertical
partition scenario (ref. Fig. 9.6 ), on the other hand, the configuration parameters
include additionally the length of join paths, the type of the joins, and the number of
attributes involved in each such join. Expanded scenarios can then be concatenated
to produce even larger mapping scenarios. Figure 9.7 a illustrates an expanded unnest
basic mapping scenario, and Fig. 9.7 b illustrates how a large synthetic scenario is
created by concatenating smaller scenarios. STBenchmark 5
has also the ability to
5 www.stbenchmark.org.
Search WWH ::




Custom Search