Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

In our experiments, we estimate that our multi-armed bandit approach is one hundred

to one thousand times faster than the naive Monte Carlo optimization procedure, which

clearly demonstrates the benefits of this approach. Note that this idea could also be

relevant to our numerical case. The main difference is that the corresponding multi-

armed bandit problem relies on a continuous-arm space. Although some algorithms

have already been proposed to solve such multi-armed bandit problems [12], how to

scale these techniques to problems with hundreds or thousands parameters is still an

open research question. Progresses in this field could directly benefit our numerical

learning approach.

6

Numerical Experiments

We now illustrate the two instances of our learning approach by comparing learned poli-

cies against a number of generic previously proposed policies in a setting where prior

knowledge is available about the target problems. We show that in both cases, learn-

ing enables to obtain exploration/exploitation strategies significantly outperforming all

tested generic policies.

6.1

Experimental Protocol

We compare learned policies against generic policies. We distinguish between untuned

generic policies and tuned generic policies . The former are either policies that are

parameter-free or policies used with default parameters suggested in the literature, while

the latter are generic policies whose hyper-parameters were tuned using Algorithm 2.

Training and Testing . To illustrate our approach, we consider the scenario where the

number of arms K , the playing horizon T and the kind of distributions ν k are known

a priori and where the parameters of these distributions are missing information. Since

we are learning policies, care should be taken with generalization issues. As usual in

supervised machine learning, we use a training set which is distinct from the testing set.

The training set is composed of N = 100 bandit problems sampled from a given distri-

bution over bandit problems

D P whereas the testing set contains another 10000 prob-

lems drawn from this distribution. To study the robustness of our policies w.r.t. wrong

prior information, we also report their performance on a set of 10000 problems drawn

from another distribution

D P with different kinds of distributions ν k . When computing

Δ ( π θ ) , we estimate the regret for each of these problems by averaging results overs 100

runs. One calculation of Δ ( π θ ) thus involves simulating 10 4 (resp. 10 6 ) bandit episodes

during training (resp. testing).

Problem Distributions . The distribution

D P is composed of two-armed bandit prob-

lems with Bernoulli distributions whose expectations are uniformly drawn from [0 , 1] .

Hence, in order to sample a bandit problem from

D P , we draw the expectations p 1 and

p 2 uniformly from [0 , 1] and return the bandit problem with two Bernoulli arms that

have expectations p 1

D P , the reward

distributions ν k are changed by Gaussian distributions truncated to the interval [0 , 1] .In

order to sample one problem from

and p 2 , respectively. In the second distribution

D P , we select a mean and a standard deviation for

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home