Information Technology Reference
In-Depth Information
In our experiments, we estimate that our multi-armed bandit approach is one hundred
to one thousand times faster than the naive Monte Carlo optimization procedure, which
clearly demonstrates the benefits of this approach. Note that this idea could also be
relevant to our numerical case. The main difference is that the corresponding multi-
armed bandit problem relies on a continuous-arm space. Although some algorithms
have already been proposed to solve such multi-armed bandit problems [12], how to
scale these techniques to problems with hundreds or thousands parameters is still an
open research question. Progresses in this field could directly benefit our numerical
learning approach.
6
Numerical Experiments
We now illustrate the two instances of our learning approach by comparing learned poli-
cies against a number of generic previously proposed policies in a setting where prior
knowledge is available about the target problems. We show that in both cases, learn-
ing enables to obtain exploration/exploitation strategies significantly outperforming all
tested generic policies.
6.1
Experimental Protocol
We compare learned policies against generic policies. We distinguish between untuned
generic policies and tuned generic policies . The former are either policies that are
parameter-free or policies used with default parameters suggested in the literature, while
the latter are generic policies whose hyper-parameters were tuned using Algorithm 2.
Training and Testing . To illustrate our approach, we consider the scenario where the
number of arms K , the playing horizon T and the kind of distributions ν k are known
a priori and where the parameters of these distributions are missing information. Since
we are learning policies, care should be taken with generalization issues. As usual in
supervised machine learning, we use a training set which is distinct from the testing set.
The training set is composed of N = 100 bandit problems sampled from a given distri-
bution over bandit problems
D P whereas the testing set contains another 10000 prob-
lems drawn from this distribution. To study the robustness of our policies w.r.t. wrong
prior information, we also report their performance on a set of 10000 problems drawn
from another distribution
D P with different kinds of distributions ν k . When computing
Δ ( π θ ) , we estimate the regret for each of these problems by averaging results overs 100
runs. One calculation of Δ ( π θ ) thus involves simulating 10 4 (resp. 10 6 ) bandit episodes
during training (resp. testing).
Problem Distributions . The distribution
D P is composed of two-armed bandit prob-
lems with Bernoulli distributions whose expectations are uniformly drawn from [0 , 1] .
Hence, in order to sample a bandit problem from
D P , we draw the expectations p 1 and
p 2 uniformly from [0 , 1] and return the bandit problem with two Bernoulli arms that
have expectations p 1
D P , the reward
distributions ν k are changed by Gaussian distributions truncated to the interval [0 , 1] .In
order to sample one problem from
and p 2 , respectively. In the second distribution
D P , we select a mean and a standard deviation for
 
Search WWH ::




Custom Search