Uncertainty in Data Integration and Dataspace Support Platforms - Schema Matching and Mapping

Databases Reference

In-Depth Information

In this section, we first motivate the need for probabilistic mediated schemas

(p-med-schemas) with an example (Sect. 4.1 ). In Sect. 4.2 , we formally define

p-med-schemas and relate them with p-mappings in terms of expressive power and

semantics of query answering. Then, in Sect. 4.3 , we describe an algorithm for creat-

ing a p-med-schema from a set of data sources. Finally, Sect. 4.4 gives an algorithm

for consolidating a p-med-schema into a single schema that is visible to the user in

a pay-as-you-go system.

4.1

P-med-Schema Motivating Example

Let us begin with an example motivating p-med-schemas. Consider a setting in

which we are trying to automatically infer a mediated schema from a set of data

sources, where each of the sources is a single relational table. In this context, the

mediated schema can be thought of as a “clustering” of source attributes, with sim-

ilar attributes being grouped into the same cluster. The quality of query answers

critically depends on the quality of this clustering. Because of the heterogeneity of

the data sources being integrated, one is typically unsure of the semantics of the

source attributes and in turn of the clustering.

Example 9. Consider two source schemas both describing people:

S1(name, hPhone, hAddr, oPhone, oAddr)

S2(name, phone, address)

In S2, the attribute phone can either be a home phone number or be an office

phone number. Similarly, address can either be a home address or be an office

address.

Suppose we cluster the attributes of S1 and S2. There are multiple ways to cluster

the attributes, and they correspond to different mediated schemas. Below we list a

few of them:

M1(

name

phone, hPhone, oPhone

address, hAddr, oAddr

)

M2(

name

phone, hPhone

oPhone

address, oAddr

hAddr

)

M3(

name

phone, hPhone

oPhone

address, hAddr

oAddr

)

M4(

name

phone, oPhone

hPhone

address, oAddr

hAddr

)

M5(

name

phone

hPhone

oPhone

address

hAddr

oAddr

)

None of the listed mediated schemas is perfect. Schema M 1 groups multiple

attributes from S1. M 2 seems inconsistent because phone is grouped with hPhone

while address is grouped with oAddress . Schemas M 3 ;M 4 ,andM 5 are partially

correct, but none of them captures the fact that phone and address can be either

home phone and home address, or office phone and office address.

Even if we introduce probabilistic schema mappings, none of the listed mediated

schemas will return ideal answers. For example, using M 1 prohibits returning cor-

rect answers for queries that contain both hPhone and oPhone because they are

Schema Matching and Mapping

Search WWH ::

Custom Search

Home