Database Reference
In-Depth Information
3. THE QUERY EVALUATION PROBLEM
whose running time is exponential in the number of Boolean variables, because one would have to
iterate over all 2
n
assignments
θ
, check if they satisfy
, then add the probabilities of those that do.
Typically,
n
is the number of records in the database, hence this approach is prohibitive. It turns out
that there is, essentially, no better way of computing
P ()
in general: this problem is provably hard.
In order to state this formally, we introduce two problems.
Model Counting Problem
Given a propositional formula
, count the number of satisfying as-
signments #
, i.e., compute #
=|
ω()
|
.
Probability Computation Problem
Given a propositional formula
and a probability
P(X)
∈
[
0
,
1
]
for each Boolean variable
X
, compute the probability
P ()
=
θ
∈
ω()
P(θ)
.
Model counting is a special case of probability computation, because any algorithm for com-
puting
P ()
can be used to compute #
. Define
P(X)
=
1
/
2 for every variable
X
: then
P(θ)
=
1
/
2
n
2
n
.
A classical result, which we review here, is that the model counting problem is hard, even if
is
restricted to a simple class of propositional formulas, as discussed next: this implies immediately that
the probability computation problem is also hard.
Recall that SAT is NP-complete, where SAT is the satisfiability problem: “given
, check
if
is satisfiable”. The decision problem 3SAT, where
is restricted to a 3CNF formula, is also
NP-complete, but the problem 2SAT is in polynomial time.
The complexity class #P was introduced by
Valiant
[
1979
] and consists of all function problems
of the following type: given a polynomial-time, non-deterministic Turing machine, compute the
number of accepting computations. The model counting problem, “given
, compute #
”, is also
denoted #SAT, and is obviously in #P. Note that any algorithm solving #SAT can be used to solve
SAT, by simply using the former to obtain #
and then testing whether #
>
0.
Valiant proved that #SAT is hard for #P. He also showed that computing #
remains hard for
#P even if
is restricted to be in P2CNF, the class of positive 2CNF formulas, i.e., formulas where
each clause consists of two positive literals,
X
i
∨
X
j
. It follows immediately that #SAT remains
hard for #P even for P2DNF formulas, i.e., formulas that are disjunctions of conjuncts of the form
X
i
X
j
. This has been further strengthened by
Provan and Ball
[
1983
]; they proved the following
result, which is the most important hardness result used in probabilistic databases:
for every assignment
θ
, where
n
is the number of variables, and therefore #
=
P ()
·
Theorem 3.1
Let
X
1
,X
2
,...
and
Y
1
,Y
2
,...
be two disjoint sets of Boolean variables.
A
Positive, Partitioned 2-DNF
propositional formula is a DNF formula of the form:
=
(i,j)
X
i
Y
j
∈
E
The #PP2DNF problem is “given a PP2DNF formula
, compute #
”.