TheQuery Evaluation Problem - Probabilistic Databases

Database Reference

In-Depth Information

3. THE QUERY EVALUATION PROBLEM

whose running time is exponential in the number of Boolean variables, because one would have to

iterate over all 2 n assignments θ , check if they satisfy , then add the probabilities of those that do.

Typically, n is the number of records in the database, hence this approach is prohibitive. It turns out

that there is, essentially, no better way of computing P () in general: this problem is provably hard.

In order to state this formally, we introduce two problems.

Model Counting Problem Given a propositional formula , count the number of satisfying as-

signments # , i.e., compute # =| ω() |

.

Probability Computation Problem Given a propositional formula and a probability P(X) ∈

[ 0 , 1 ]

for each Boolean variable X , compute the probability P () = θ ∈ ω() P(θ) .

Model counting is a special case of probability computation, because any algorithm for com-

puting P () can be used to compute # . Define P(X) = 1 / 2 for every variable X : then P(θ) =

1 / 2 n

2 n .

A classical result, which we review here, is that the model counting problem is hard, even if is

restricted to a simple class of propositional formulas, as discussed next: this implies immediately that

the probability computation problem is also hard.

Recall that SAT is NP-complete, where SAT is the satisfiability problem: “given , check

if is satisfiable”. The decision problem 3SAT, where is restricted to a 3CNF formula, is also

NP-complete, but the problem 2SAT is in polynomial time.

The complexity class #P was introduced by Valiant [ 1979 ] and consists of all function problems

of the following type: given a polynomial-time, non-deterministic Turing machine, compute the

number of accepting computations. The model counting problem, “given , compute # ”, is also

denoted #SAT, and is obviously in #P. Note that any algorithm solving #SAT can be used to solve

SAT, by simply using the former to obtain # and then testing whether # > 0.

Valiant proved that #SAT is hard for #P. He also showed that computing # remains hard for

#P even if is restricted to be in P2CNF, the class of positive 2CNF formulas, i.e., formulas where

each clause consists of two positive literals, X i ∨ X j . It follows immediately that #SAT remains

hard for #P even for P2DNF formulas, i.e., formulas that are disjunctions of conjuncts of the form

X i X j . This has been further strengthened by Provan and Ball [ 1983 ]; they proved the following

result, which is the most important hardness result used in probabilistic databases:

for every assignment θ , where n is the number of variables, and therefore #

=

P ()

·

Theorem 3.1

Let X 1 ,X 2 ,... and Y 1 ,Y 2 ,... be two disjoint sets of Boolean variables.

A Positive, Partitioned 2-DNF propositional formula is a DNF formula of the form:

=

(i,j)

X i Y j

∈

E

The #PP2DNF problem is “given a PP2DNF formula , compute # ”.

Probabilistic Databases

Search WWH ::

Custom Search

Home