Database Reference
In-Depth Information
6. ADVANCED TECHNIQUES
process, like Monte Carlo simulation. We show how to concentrate most of the iterations on the
top k probabilities and compute for the others only a very coarse approximation, just the necessary
approximation to ensure that they are not in the top k .
Consider the following problem. There are n possible tuples, t 1 ,t 2 ,...,t n , each has a prob-
ability, p 1 ,...,p n , and our task is to rank the tuples in decreasing order of their probabilities and
return only the top k tuples: in other words, compute Top k = (t i 1 ,t i 2 ,...,t i k ) , where i 1 ,i 2 ,...,i n
is a permutation such that p i 1 p i 2 ... p i n . Assume that we have no easy method to compute
these probabilities exactly, but, instead, we have a lower bound and an upper bound, p i ∈[ L i ,U i ]
,
for i = 1 ,n . These bounds may be obtained by one of the approximation methods in Section 5.3 .
Through an iterative process we can improve each interval, i.e., at each iteration we narrow the
bound to
[ L i ,U i ]⊂[ L i ,U i ]
: we call such an iteration a simulation step ; in other words, a simulation
step for the tuple t i improves its bound from
L i ,U i ]
. The Multisimulation problem is
to decide in which order to apply the simulation steps to all the intervals
[
L i ,U i ]
to
[
[ L 1 ,U 1 ] ,..., [ L n ,U n ]
,in
order to quickly find Top k . This is a catch-22 problem: if we knew the set Top k , then we could focus
all simulation steps only on the k tuples in this set, but in order to compute this set, we need to know
the probabilities of all n tuples. In general, n is large (say, 1000 or more) while k is small (say, 10).
To simplify our discussion, we will assume that the probabilities p 1 ,...,p n are distinct,
p i = p j for i = j , which implies that Top k
is uniquely defined. We discuss below how to relax this
assumption.
In order to be able to make incremental progress on each interval
, we need to maintain
some additional data structures for each tuple t i , which we denote generically by G i , i
[ L i ,U i ]
1 ,n .For
example, consider the two approximation algorithms discussed in Section 5.3 .For Algorithm 3 ,we
need to maintain for each tuple t i the currently expanded circuit for its lineage expression and the
lower/upper bound for each gate in the circuit; for the Monte Carlo approximation algorithms in
Subsection 5.3.2 , we need to maintain the total number of trials N and the number of successful
trials c , for each t i .
We split the multisimulation problem into two parts. First, compute the setTop k ={ t 1 ,...,t k }
=
,
then rank this set.
6.1.1 COMPUTING THE SET Top k
Given a n positive numbers a 1 ,...,a n denote max k (a 1 ,...,a n ) the k 's largest number in the
sequence. In other words, given any ordered permutation a i 1 a i 2 ... a i n , max k (A) = a i k .If
k>n then, by convention, max k (a 1 ,...,a n ) =
0. The critical region for a set of n intervals
[ L i ,U i ]
is the interval (L c ,U c ) where:
L c
U c
=
max
k
(L 1 ,...,L n )
=
max
k + 1 (U 1 ,...,U n )
Figure 6.1 illustrates the critical region for the case when n = 5 and k = 2. The critical region
plays a key role in guiding the algorithm that simulates the tuples t 1 ,...,t n in an optimal order.
Search WWH ::




Custom Search