Intensional Query Evaluation - Probabilistic Databases

Database Reference

In-Depth Information

The expected value of Z k is

P(φ i )

P(θ)

P(φ i ) ·

[ Z k ]=

φ j

∈

ω(φ j )

θ ∈ ω(φ i )

P(θ) ·|{ φ i | θ ∈ ω(φ i ) }|

M ·|{ φ j

| θ ∈ ω(φ j ) }|

θ :∃ φ i θ ∈ ω(φ i )

M ·

P(θ)

θ :∃ φ i θ ∈ ω( φ i )

P ()

M ,

[ Z ]= N · p/M. We approximate p thus by

p =

so Z k

is an unbiased estimator for p/M and E

Z · M/N , and its expected value is E

[ Z ]· M/N = p .

Computing Z consists of summing up the outcome of N Bernoulli trials. For such a scenario,

we can use the Chernoff bound

Pr | Z −

[ p ]=

[ Z ] ≤

· e − ε 2

[ Z ] / 3

[ Z ]| ≥ ε ·

(cf., e.g., [ Mitzenmacher and Upfal , 2005 ], Eq. 4.6). By substitution, we get

Pr N

Pr | p − p |≥ ε · p =

N · p

N · p · ε 2

≤ 2 · e −

M ·| p − p |≥ ε ·

and thus since p/M

≥

1 /n ,

Pr | p − p |≥ ε · p ≤

N · ε 2

3 · n

· e −

= δ

By choosing

log δ

ε 2

· n ·

N =

(5.9)

we get an (ε, δ) fully polynomial-time randomized approximation scheme (FPTRAS) for computing

the probability of a DNF over independent Boolean random variables.

Thus, we have two algorithms, the naïve Monte Carlo, and Karp-Luby, for approximating

the probability of a DNF expression. The question whether one is preferable over the other depends

on whether queries usually compute only large probabilities or not. Consider the two bounds on N,

for the naïve and the Karp-Luby algorithm, given by Eq. (5.8) and Eq. (5.9) , respectively. The first

is of the form N = C(ε, δ)/p while the second is of the form N = C(ε,δ) ∗ n , where C(ε,δ) is

O( log ( 2 /δ)/ε 2 ) . Recall that p = P () is the probability of the formula, while n is the number of

conjuncts in . This suggests a trade-off between the two algorithms; the naïve MC is preferable

Probabilistic Databases

Search WWH ::

Custom Search

Home