On Evaluating Schema Matching and Mapping - Schema Matching and Mapping

Databases Reference

In-Depth Information

new matching evaluation scenario. The selection of the portions was done in a way

that preserved five main properties: (1) the complexity of the matching operators,

(2) the incrementality, i.e., the ability to reveal weaknesses of the matching tool

under evaluation, (3) the ability to distinguish among the different matching solu-

tions, (4) the quality preservation, meaning that any matching quality measure

calculated on the subset of the schemas did not differ substantially from the measure

calculated on the whole dataset, and (5) the correctness, meaning that any matches

considered were correct.

A top-down approach has also been proposed for data exchange systems

[ Okawara et al. 2006 ] and is the model upon which the THALIA [ Hammer et al.

2005 ] integration benchmark is based. In particular, Thalia provides a large dataset

and the filters that can select portions of this dataset in terms of values and schemas.

eTuner [ Lee et al. 2007 ] is a tool for automatically tuning matchers that utilizes

the instance data in conjunction with the schema information and can also be used

to create synthetic scenarios in the top-down fashion. It starts with an initial schema,

and splits it into two, each keeping the same structure but half of the instance data.

The correct matches between the schemas generated by the split are known, and the

idea is to apply transformations to one of the two schemas to create a new schema.

The transformations are based on rules at three levels: (1) modifications on the struc-

ture of the schema, (2) changes of the schema element names, and (3) perturbations

of the data. The matchings between schema elements are traced through the whole

process so that they are known at the end and are used for evaluating the matchers.

A limitation of eTuner is that the user needs to create or find a reference ontology.

Furthermore, the set of modifications that can be performed on the data is limited,

making the perturbated data look less similar to natural real-world data.

In the bottom-up approach of synthetic scenario generation, some small scenario

is used as a seed for the construction of more complex scenarios. STBench-

mark [ Alexe et al. 2008b ] is based on this idea to provide synthetic mapping test

scenarios, i.e., a synthetic source schema, a target schema, an expected mapping

between the source and the target schema, and an instance of the source schema.

The seeds it uses are its basic scenarios that were mentioned in the previous sec-

tion. Given a basic scenario, STBenchmark constructs an expanded version of it.

The expanded version is an image of the original scenario but on a larger scale.

The scale is determined by dimensions specified through configuration parameters

representing characteristics of the schemas and the mappings. For instance, in a

copy basic scenario, the configuration parameters are the average nesting depth of

the schemas and the average number of attributes of each element. In the vertical

partition scenario (ref. Fig. 9.6 ), on the other hand, the configuration parameters

include additionally the length of join paths, the type of the joins, and the number of

attributes involved in each such join. Expanded scenarios can then be concatenated

to produce even larger mapping scenarios. Figure 9.7 a illustrates an expanded unnest

basic mapping scenario, and Fig. 9.7 b illustrates how a large synthetic scenario is

created by concatenating smaller scenarios. STBenchmark 5

has also the ability to

5 www.stbenchmark.org.

Schema Matching and Mapping

Search WWH ::

Custom Search

Home