Advanced Techniques - Probabilistic Databases

Database Reference

In-Depth Information

6. ADVANCED TECHNIQUES

process, like Monte Carlo simulation. We show how to concentrate most of the iterations on the

top k probabilities and compute for the others only a very coarse approximation, just the necessary

approximation to ensure that they are not in the top k .

Consider the following problem. There are n possible tuples, t 1 ,t 2 ,...,t n , each has a prob-

ability, p 1 ,...,p n , and our task is to rank the tuples in decreasing order of their probabilities and

return only the top k tuples: in other words, compute Top k = (t i 1 ,t i 2 ,...,t i k ) , where i 1 ,i 2 ,...,i n

is a permutation such that p i 1 ≥ p i 2 ≥ ... ≥ p i n . Assume that we have no easy method to compute

these probabilities exactly, but, instead, we have a lower bound and an upper bound, p i ∈[ L i ,U i ]

for i = 1 ,n . These bounds may be obtained by one of the approximation methods in Section 5.3 .

Through an iterative process we can improve each interval, i.e., at each iteration we narrow the

bound to

[ L i ,U i ]⊂[ L i ,U i ]

: we call such an iteration a simulation step ; in other words, a simulation

step for the tuple t i improves its bound from

L i ,U i ]

. The Multisimulation problem is

to decide in which order to apply the simulation steps to all the intervals

[

L i ,U i ]

[

[ L 1 ,U 1 ] ,..., [ L n ,U n ]

,in

order to quickly find Top k . This is a catch-22 problem: if we knew the set Top k , then we could focus

all simulation steps only on the k tuples in this set, but in order to compute this set, we need to know

the probabilities of all n tuples. In general, n is large (say, 1000 or more) while k is small (say, 10).

To simplify our discussion, we will assume that the probabilities p 1 ,...,p n are distinct,

p i = p j for i = j , which implies that Top k

is uniquely defined. We discuss below how to relax this

assumption.

In order to be able to make incremental progress on each interval

, we need to maintain

some additional data structures for each tuple t i , which we denote generically by G i , i

[ L i ,U i ]

1 ,n .For

example, consider the two approximation algorithms discussed in Section 5.3 .For Algorithm 3 ,we

need to maintain for each tuple t i the currently expanded circuit for its lineage expression and the

lower/upper bound for each gate in the circuit; for the Monte Carlo approximation algorithms in

Subsection 5.3.2 , we need to maintain the total number of trials N and the number of successful

trials c , for each t i .

We split the multisimulation problem into two parts. First, compute the setTop k ={ t 1 ,...,t k }

then rank this set.

6.1.1 COMPUTING THE SET Top k

Given a n positive numbers a 1 ,...,a n denote max k (a 1 ,...,a n ) the k 's largest number in the

sequence. In other words, given any ordered permutation a i 1 ≥ a i 2 ≥ ... ≥ a i n , max k (A) = a i k .If

k>n then, by convention, max k (a 1 ,...,a n ) =

0. The critical region for a set of n intervals

[ L i ,U i ]

is the interval (L c ,U c ) where:

L c

U c

max

(L 1 ,...,L n )

max

k + 1 (U 1 ,...,U n )

Figure 6.1 illustrates the critical region for the case when n = 5 and k = 2. The critical region

plays a key role in guiding the algorithm that simulates the tuples t 1 ,...,t n in an optimal order.

Probabilistic Databases

Search WWH ::

Custom Search

Home