TheQuery Evaluation Problem - Probabilistic Databases

Database Reference

In-Depth Information

Note that

C(θ)

is a number between 0 and m . Then, we claim that:

P( ¬ H 1 | E θ ) = z | C(θ) |

(3.4)

Indeed, consider a world W that satisfies E θ . Since H 1 =

R(x 0 ), S(x 0 ,y 0 )

∨

S(x 1 ,y 1 ), T (y 1 ) ,we

have

¬ (W

H 1 ) iff both queries R(x),S(x,y) and S(x,y),T (y) are false on W . Consider a tuple

Y j , then either R W (X i ) is true, or T W (Y j ) is true (because

(X i ,Y j )

∈

S .If θ satisfies the clause X i ∨

¬ S W (X i ,Y j ) to ensure that both R W (X i ), S W (X i ,Y j ) and

S W (X i ,Y j ), T W (Y j ) are false; the probability of the event

E θ ), and therefore we must have

S W (X i ,Y j ) is z .If θ does not satisfy

the clause X i ∨ Y j , then both queries R(X i ), S(X i ,Y j ) and S(X i ,Y j ), T (Y j ) are false regardless of

whether S W

H 1 ) iff S W

contains the tuple (X i ,Y j ) or not. In other words,

¬ (W

does not contain

any tuple (X i ,Y j ) for which (i, j) ∈ C(θ) ; this proves Eq. (3.4) .

Finally, we compute P( ¬ H 1 ) . For any number c , 0 ≤ c ≤ m , let # c =

the number of valuations

θ that satisfy exactly c clauses, i.e., # c =|{ θ | c =| C(θ) |}|

. Then, Eq. (3.3) and Eq. (3.4) become:

2 n

c = 0 ,m

# c · z c

P( ¬ H 1 ) =

This is a polynomial in z of degree m , with coefficients #0 , #1 ,..., # m . In other words, and oracle

for P(

# m , because in # represents the

number of valuations that satisfy all m clauses. Therefore, we can compute # using an oracle for

P( ¬ H 1 ) as follows. Choose any m + 1 distinct values for z ∈ ( 0 , 1 ) , and construct m + 1 different

database instances R, S, T (they are isomorphic, and differ only in the probabilities in S , which

are set to 1 − z ). Then we call the oracle, and obtain the value of the polynomial at that point z .

From these m

H 1 ) computes the polynomial above. Note that #

1 values we will derive all the coefficients, e.g., by using Lagrange's polynomial

interpolation formula. The leading coefficient, # m , is precisely # .

As we will show in Chapter 4 , not all queries are hard; in fact, many queries can be evalu-

ated quite efficiently on tuple-independent, or on BID databases. Moreover, the list of queries H k ,

0 , 1 ,... is not even complete: there are many other queries that are also hard. For Unions of

Conjunctive Queries (UCQ), we know exactly which queries are hard, and this class includes all the

queries H k and others too; we will discuss this in Chapter 4 . For the full Relational Calculus (RC),

the class of hard queries is not known exactly (of course, it includes all UCQ queries that are hard).

By using the queries H k as primitives, it is quite easy to prove that some other queries are

hard. We illustrate this on the example below which is both interesting in its own right, and also

illustrates one key proof technique used in the general hardness proof [ Dalvi and Suciu , 2010 ].

Example 3.3

Consider the Boolean query

Q =∃ x. ∃ y. ∃ z.U(x,y),U(y,z)

We prove that it is hard for # P . The query checks for the presence of a path of length 2 in the graph

defined by the binary edge relation U . We will prove that it is hard even if we restrict the graph to a

Probabilistic Databases

Search WWH ::

Custom Search

Home