Uncertainty in Data Integration and Dataspace Support Platforms - Schema Matching and Mapping

Databases Reference

In-Depth Information

Step 1: We generate the possible reformulations of Q (a reformulation query com-

putes all certain answers when executed on the source data) by considering every

combination of the form .m 1 ;:::;m l /,wherem i is one of the possible mappings

in pM i . Denote the set of reformulations by Q 0 1 ;:::;Q 0 k

. The probability of a

reformulation Q 0 D .m 1 ;:::;m l / is ˘ i D 1 Pr.m i /.

Step 2: For each reformulation Q 0 , retrieve each of the unique answers from the

sources. For each answer obtained by Q 0 1 [ ::: [ Q 0 k

, its probability is computed

by summing the probabilities of the Q 0 's in which it is returned.

Importantly, note that it is possible to express both steps as an SQL query with

grouping and aggregation. Therefore, if the underlying sources support SQL, we

can leverage their optimizations to compute the answers.

With our restricted form of schema mapping, the algorithm takes time poly-

nomial in the size of the data and the mappings. We, thus, have the following

complexity result.

Theorem 1. Let pM be a schem a p- mapping and let Q be an SPJ query.

Answering Q with respect to pM in by-table semantics is in PTIME in the size

of the data and the mapping.

t

3.3.2

By-tuple Query Answering

To extend the by-table query-answering strategy to by-tuple semantics, we would

need to compute the certain answers for every mapping sequence generated by pM.

However, the number of such mapping sequences is exponential in the size of the

input data. The following example shows that for certain queries, this exponential

time complexity is inevitable.

Example 4. Suppose that in addition to the tables in Example 1 ,wealsohave

U(city) in the source and V(hightech) in the target. The p-mapping for V contains

two possible mappings: (

,0.2).

Consider the following query Q, which decides if there are any people living in

a high-tech city.

f

(city, hightech)

g

,0.8)and(

;

Q: SELECT 'true'

FROM T, V

WHERE T.mailing-addr = V.hightech

An incorrect way of answering the query is to first execute the following two

subqueries Q 1

and Q 2

and then join the answers of Q 1

and Q 2 , and summing up

the probabilities.

Q1: SELECT mailing-addr FROM T

Q2: SELECT hightech FROM V

Now, consider the source instance D,whereD S isshowninFig. 4.2 a, and D U

has two tuples (“Mountain View”) and (“Sunnyvale”). Figure 4.4 a,b show Q tuple

1

.D/

Schema Matching and Mapping

Search WWH ::

Custom Search

Home