TheQuery Evaluation Problem - Probabilistic Databases

Database Reference

In-Depth Information

3.2

THE COMPLEXITY OF P (Q)

We now turn to the complexity of the query evaluation problem. Throughout this section we assume

that the query Q is a Boolean query, thus the goal is to compute P (Q) . This is without loss of

generality, since the probability of any possible answer

a to a non-Boolean query Q( x) reduces to

the probability of the Boolean query Q [ a/x ]

, more precisely, P(a ∈ Q) = P(Q [ a/x ] ) . We also

restrict the input probabilistic database to a tuple-independent database: if a query Q is hard on

tuple-independent databases, then it remains hard over more expressive probabilistic databases, as

long as these allow tuple-independent databases as a special case.

We are interested in the data complexity of the query evaluation problem: for a fixed query Q ,

what is the complexity as a function of the database D ? The answer will depend on the query: for

some queries, the complexity is in polynomial time, for other queries it is not. A query Q is called

tractable if its data complexity is in polynomial time; otherwise, the query Q is called intractable .

Recall that, over deterministic databases, for every query in the relational calculus the data complexity

is in polynomial time [ Va rd i , 1982 ], hence it is tractable according to our terminology.

In this section we prove that, for each of the queries below, the evaluation problem is hard for

#P:

H 0 = R(x),S(x,y),T(y)

H 1 = R(x 0 ), S(x 0 ,y 0 ) ∨ S(x 1 ,y 1 ), T (y 1 )

H 2 = R(x 0 ), S 1 (x 0 ,y 0 ) ∨ S 1 (x 1 ,y 1 ), S 2 (x 1 ,y 1 ) ∨ S 2 (x 2 ,y 2 ), T (y 2 )

H 3 = R(x 0 ), S 1 (x 0 ,y 0 ) ∨ S 1 (x 1 ,y 1 ), S 2 (x 1 ,y 1 ) ∨ S 2 (x 2 ,y 2 ), S 3 (x 2 ,y 2 ) ∨ S 3 (x 3 ,y 3 ), T (y 3 )

...

Each query is a Boolean query, but we have dropped the quantifiers for conciseness; that is, H 0 =

∃ x. ∃ y.R(x),S(x,y),T (y) , etc. For each query H k , we are interested in evaluating P(H k ) on a tuple-

independent probabilistic database D

= ( R, S 1 ,...,S k ,T ,P) , and measure the complexity as a

function of the size of D (that is, the query is fixed).

For ever y k

≥

0, the data complexity of the query H k is hard for # P .

Theorem 3.2

Proof. We give two separate proofs, one for H 0 and one for H 1 ; the proof for H k , for k ≥ 2 is a

non-trivial extension of that for H 1 and is omitted; it can be found in [ Dalvi and Suciu , 2010 ].

The proof for H 0 is by reduction from #PP2DNF ( Theorem 3.1 ). Consider any formula

(i,j)

X i Y j

(3.1)

∈

and construct the following probabilistic database instance D

= ( R, S, T ,P) , where: R =

{ X 1 ,X 2 ,... }

, T

={ Y 1 ,Y 2 ,... }

, S ={ (X i ,Y j ) | (i, j) ∈ E }

, and the probability function is de-

fined as follows: P(R(X i )) = P(T(Y j )) =

1 / 2, P(S(X i ,Y j )) =

1. Every possible world is of the

Probabilistic Databases

Search WWH ::

Custom Search

Home