Database Reference
In-Depth Information
6. ADVANCED TECHNIQUES
process, like Monte Carlo simulation. We show how to concentrate most of the iterations on the
top
k
probabilities and compute for the others only a very coarse approximation, just the necessary
approximation to ensure that they are not in the top
k
.
Consider the following problem. There are
n
possible tuples,
t
1
,t
2
,...,t
n
, each has a prob-
ability,
p
1
,...,p
n
, and our task is to rank the tuples in decreasing order of their probabilities and
return only the top
k
tuples: in other words, compute
Top
k
=
(t
i
1
,t
i
2
,...,t
i
k
)
, where
i
1
,i
2
,...,i
n
is a permutation such that
p
i
1
≥
p
i
2
≥
...
≥
p
i
n
. Assume that we have no easy method to compute
these probabilities exactly, but, instead, we have a lower bound and an upper bound,
p
i
∈[
L
i
,U
i
]
,
for
i
=
1
,n
. These bounds may be obtained by one of the approximation methods in
Section 5.3
.
Through an iterative process we can improve each interval, i.e., at each iteration we narrow the
bound to
[
L
i
,U
i
]⊂[
L
i
,U
i
]
: we call such an iteration a
simulation step
; in other words, a simulation
step for the tuple
t
i
improves its bound from
L
i
,U
i
]
. The
Multisimulation
problem is
to decide in which order to apply the simulation steps to all the intervals
[
L
i
,U
i
]
to
[
[
L
1
,U
1
]
,...,
[
L
n
,U
n
]
,in
order to quickly find
Top
k
. This is a catch-22 problem: if we knew the set
Top
k
, then we could focus
all simulation steps only on the
k
tuples in this set, but in order to compute this set, we need to know
the probabilities of all
n
tuples. In general,
n
is large (say, 1000 or more) while
k
is small (say, 10).
To simplify our discussion, we will assume that the probabilities
p
1
,...,p
n
are distinct,
p
i
=
p
j
for
i
=
j
, which implies that
Top
k
is uniquely defined. We discuss below how to relax this
assumption.
In order to be able to make incremental progress on each interval
, we need to maintain
some additional data structures for each tuple
t
i
, which we denote generically by
G
i
,
i
[
L
i
,U
i
]
1
,n
.For
example, consider the two approximation algorithms discussed in
Section 5.3
.For
Algorithm 3
,we
need to maintain for each tuple
t
i
the currently expanded circuit for its lineage expression and the
lower/upper bound for each gate in the circuit; for the Monte Carlo approximation algorithms in
Subsection 5.3.2
, we need to maintain the total number of trials
N
and the number of successful
trials
c
, for each
t
i
.
We split the multisimulation problem into two parts. First, compute the
setTop
k
={
t
1
,...,t
k
}
=
,
then rank this set.
6.1.1 COMPUTING THE SET
Top
k
Given a
n
positive numbers
a
1
,...,a
n
denote max
k
(a
1
,...,a
n
)
the
k
's largest number in the
sequence. In other words, given any ordered permutation
a
i
1
≥
a
i
2
≥
...
≥
a
i
n
, max
k
(A)
=
a
i
k
.If
k>n
then, by convention, max
k
(a
1
,...,a
n
)
=
0. The
critical region
for a set of
n
intervals
[
L
i
,U
i
]
is the interval
(L
c
,U
c
)
where:
L
c
U
c
=
max
k
(L
1
,...,L
n
)
=
max
k
+
1
(U
1
,...,U
n
)
Figure 6.1
illustrates the critical region for the case when
n
=
5 and
k
=
2. The critical region
plays a key role in guiding the algorithm that simulates the tuples
t
1
,...,t
n
in an optimal order.