Uncertainty in Data Integration and Dataspace Support Platforms - Schema Matching and Mapping

Databases Reference

In-Depth Information

Theorem 9. There exists a source schema S , a p-med-schema M ,asetofone-

to-one p-mappings pM between S and possible mediated schemas in M , and an

instance D of S , such that for any deterministic mediated schema T and any

one-to-one p-mapping pM between S and T , there exists a query Q such that

Q M ; pM .D/ ¤ Q T;pM .D/ .

t

Constructing one-to-many p-mappings in practice is much harder than construct-

ing one-to-one p-mappings. And when we are restricted to one-to-one p-mappings,

p-med-schemas grant us more expressive power while keeping the process of

mapping generation feasible.

4.3

P-med-Schema Creation

We now show how to create a probabilistic mediated schema M . Given source tables

S 1 ;:::;S n , we first construct the multiple schemas M 1 ;:::;M p

in M ,andthen

assign each of them a probability.

We exploit two pieces of information available in the source tables: (1) pairwise

similarity of source attributes, and (2) statistical co-occurrence properties of source

attributes. The former is used for creating multiple mediated schemas and the latter

for assigning probabilities on each of the mediated schemas.

The first piece of information tells us when two attributes are likely to be similar

and is generated by a collection of schema matching modules. This information is

typically given by some pairwise attribute similarity measure, say s. The similar-

ity s.a i ;a j / between two source attributes a i

and a j

depicts how closely the two

attributes represent the same real-world concept.

The second piece of information tells us when two attributes are likely to be

different. Consider for example, source table schemas

S1: (name,address,email-address)

S2: (name,home-address)

Pairwise string similarity would indicate that attribute address can be similar to

both email-address and home-address . However, since the first source table con-

tains address and email-address together, they cannot refer to the same concept.

Hence, the first table suggests address is different from email-address ,makingit

more likely that address refers to home-address .

Creating multiple mediated schemas: The creation of the multiple mediated schemas

constituting the p-med-schema can be divided conceptually into three steps. First,

we remove infrequent attributes from the set of all source attributes, that is, attribute

names that do not appear in a large fraction of source tables. This step ensures that

our mediated schema contains only information that is relevant and central to the

domain. In the second step, we construct a weighted graph whose nodes are the

attributes that survived the filter of the first step. An edge in the graph is labeled

Schema Matching and Mapping

Search WWH ::

Custom Search

Home