Databases Reference
In-Depth Information
In this section, we first motivate the need for probabilistic mediated schemas
(p-med-schemas) with an example (Sect. 4.1 ). In Sect. 4.2 , we formally define
p-med-schemas and relate them with p-mappings in terms of expressive power and
semantics of query answering. Then, in Sect. 4.3 , we describe an algorithm for creat-
ing a p-med-schema from a set of data sources. Finally, Sect. 4.4 gives an algorithm
for consolidating a p-med-schema into a single schema that is visible to the user in
a pay-as-you-go system.
4.1
P-med-Schema Motivating Example
Let us begin with an example motivating p-med-schemas. Consider a setting in
which we are trying to automatically infer a mediated schema from a set of data
sources, where each of the sources is a single relational table. In this context, the
mediated schema can be thought of as a “clustering” of source attributes, with sim-
ilar attributes being grouped into the same cluster. The quality of query answers
critically depends on the quality of this clustering. Because of the heterogeneity of
the data sources being integrated, one is typically unsure of the semantics of the
source attributes and in turn of the clustering.
Example 9. Consider two source schemas both describing people:
S1(name, hPhone, hAddr, oPhone, oAddr)
S2(name, phone, address)
In S2, the attribute phone can either be a home phone number or be an office
phone number. Similarly, address can either be a home address or be an office
address.
Suppose we cluster the attributes of S1 and S2. There are multiple ways to cluster
the attributes, and they correspond to different mediated schemas. Below we list a
few of them:
M1(
f
name
g
,
f
phone, hPhone, oPhone
g
,
f
address, hAddr, oAddr
g
)
M2(
f
name
g
,
f
phone, hPhone
g
,
f
oPhone
g
,
f
address, oAddr
g
,
f
hAddr
g
)
M3(
f
name
g
,
f
phone, hPhone
g
,
f
oPhone
g
,
f
address, hAddr
g
,
f
oAddr
g
)
M4(
f
name
g
,
f
phone, oPhone
g
,
f
hPhone
g
,
f
address, oAddr
g
,
f
hAddr
g
)
M5(
f
name
g
,
f
phone
g
,
f
hPhone
g
,
f
oPhone
g
,
f
address
g
,
f
hAddr
g
,
f
oAddr
g
)
None of the listed mediated schemas is perfect. Schema M 1 groups multiple
attributes from S1. M 2 seems inconsistent because phone is grouped with hPhone
while address is grouped with oAddress . Schemas M 3 ;M 4 ,andM 5 are partially
correct, but none of them captures the fact that phone and address can be either
home phone and home address, or office phone and office address.
Even if we introduce probabilistic schema mappings, none of the listed mediated
schemas will return ideal answers. For example, using M 1 prohibits returning cor-
rect answers for queries that contain both hPhone and oPhone because they are
Search WWH ::




Custom Search