Databases Reference
In-Depth Information
Theorem 9. There exists a source schema S , a p-med-schema M ,asetofone-
to-one p-mappings pM between S and possible mediated schemas in M , and an
instance D of S , such that for any deterministic mediated schema T and any
one-to-one p-mapping pM between S and T , there exists a query Q such that
Q M ; pM .D/ ยค Q T;pM .D/ .
t
Constructing one-to-many p-mappings in practice is much harder than construct-
ing one-to-one p-mappings. And when we are restricted to one-to-one p-mappings,
p-med-schemas grant us more expressive power while keeping the process of
mapping generation feasible.
4.3
P-med-Schema Creation
We now show how to create a probabilistic mediated schema M . Given source tables
S 1 ;:::;S n , we first construct the multiple schemas M 1 ;:::;M p
in M ,andthen
assign each of them a probability.
We exploit two pieces of information available in the source tables: (1) pairwise
similarity of source attributes, and (2) statistical co-occurrence properties of source
attributes. The former is used for creating multiple mediated schemas and the latter
for assigning probabilities on each of the mediated schemas.
The first piece of information tells us when two attributes are likely to be similar
and is generated by a collection of schema matching modules. This information is
typically given by some pairwise attribute similarity measure, say s. The similar-
ity s.a i ;a j / between two source attributes a i
and a j
depicts how closely the two
attributes represent the same real-world concept.
The second piece of information tells us when two attributes are likely to be
different. Consider for example, source table schemas
S1: (name,address,email-address)
S2: (name,home-address)
Pairwise string similarity would indicate that attribute address can be similar to
both email-address and home-address . However, since the first source table con-
tains address and email-address together, they cannot refer to the same concept.
Hence, the first table suggests address is different from email-address ,makingit
more likely that address refers to home-address .
Creating multiple mediated schemas: The creation of the multiple mediated schemas
constituting the p-med-schema can be divided conceptually into three steps. First,
we remove infrequent attributes from the set of all source attributes, that is, attribute
names that do not appear in a large fraction of source tables. This step ensures that
our mediated schema contains only information that is relevant and central to the
domain. In the second step, we construct a weighted graph whose nodes are the
attributes that survived the filter of the first step. An edge in the graph is labeled
Search WWH ::




Custom Search