Database Reference
In-Depth Information
3.2
THE COMPLEXITY OF
P (Q)
We now turn to the complexity of the query evaluation problem. Throughout this section we assume
that the query
Q
is a Boolean query, thus the goal is to compute
P (Q)
. This is without loss of
generality, since the probability of any possible answer
a
to a non-Boolean query
Q( x)
reduces to
the probability of the Boolean query
Q
[
a/x
]
, more precisely,
P(a
∈
Q)
=
P(Q
[
a/x
]
)
. We also
restrict the input probabilistic database to a tuple-independent database: if a query
Q
is hard on
tuple-independent databases, then it remains hard over more expressive probabilistic databases, as
long as these allow tuple-independent databases as a special case.
We are interested in the
data complexity
of the query evaluation problem: for a fixed query
Q
,
what is the complexity as a function of the database
D
? The answer will depend on the query: for
some queries, the complexity is in polynomial time, for other queries it is not. A query
Q
is called
tractable
if its data complexity is in polynomial time; otherwise, the query
Q
is called
intractable
.
Recall that, over deterministic databases, for every query in the relational calculus the data complexity
is in polynomial time [
Va rd i
,
1982
], hence it is tractable according to our terminology.
In this section we prove that, for each of the queries below, the evaluation problem is hard for
#P:
H
0
=
R(x),S(x,y),T(y)
H
1
=
R(x
0
), S(x
0
,y
0
)
∨
S(x
1
,y
1
), T (y
1
)
H
2
=
R(x
0
), S
1
(x
0
,y
0
)
∨
S
1
(x
1
,y
1
), S
2
(x
1
,y
1
)
∨
S
2
(x
2
,y
2
), T (y
2
)
H
3
=
R(x
0
), S
1
(x
0
,y
0
)
∨
S
1
(x
1
,y
1
), S
2
(x
1
,y
1
)
∨
S
2
(x
2
,y
2
), S
3
(x
2
,y
2
)
∨
S
3
(x
3
,y
3
), T (y
3
)
...
Each query is a Boolean query, but we have dropped the quantifiers for conciseness; that is,
H
0
=
∃
x.
∃
y.R(x),S(x,y),T (y)
, etc. For each query
H
k
, we are interested in evaluating
P(H
k
)
on a tuple-
independent probabilistic database
D
=
(
R, S
1
,...,S
k
,T
,P)
, and measure the complexity as a
function of the size of
D
(that is, the query is fixed).
For ever y
k
≥
0, the data complexity of the query
H
k
is hard for #
P
.
Theorem 3.2
Proof.
We give two separate proofs, one for
H
0
and one for
H
1
; the proof for
H
k
, for
k
≥
2 is a
non-trivial extension of that for
H
1
and is omitted; it can be found in [
Dalvi and Suciu
,
2010
].
The proof for
H
0
is by reduction from #PP2DNF (
Theorem 3.1
). Consider any formula
=
(i,j)
X
i
Y
j
(3.1)
∈
E
and construct the following probabilistic database instance
D
=
(
R, S, T
,P)
, where:
R
=
{
X
1
,X
2
,...
}
,
T
={
Y
1
,Y
2
,...
}
,
S
={
(X
i
,Y
j
)
|
(i, j)
∈
E
}
, and the probability function is de-
fined as follows:
P(R(X
i
))
=
P(T(Y
j
))
=
1
/
2,
P(S(X
i
,Y
j
))
=
1. Every possible world is of the