Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

each arm uniformly in range [0 , 1] . Rewards are then sampled using a rejection sam-

pling approach: samples are drawn from the corresponding Gaussian distribution until

obtaining a value that belongs to the interval [0 , 1] .

Generic Policies . We consider the following generic policies: the n -G REEDY policy

as described in [4], the policies introduced by [4]: UCB1, UCB1-T UNED , UCB1-

N ORMAL and UCB2, the policy KL-UCB introduced in [13] and the policy UCB-

V proposed by [5]. Except n -G REEDY , all these policies belong to the family of

index-based policies discussed previously. UCB1-T UNED and UCB1-N ORMAL are

parameter-free policies designed for bandit problems with Bernoulli distributions and

for problems with Gaussian distributions respectively. All the other policies have hyper-

parameters that can be tuned to improve the quality of the policy. n -G REEDY has two

parameters c> 0 and 0 <d< 1 , UCB2 has one parameter 0 <α< 1 ,KL-UCBhas

one parameter c

0 and UCB-V has two parameters ζ> 0 and c> 0 . We refer the

reader to [4, 5, 13] for detailed explanations of these parameters.

Learning Numerical Policies . We learn policies using the two parameterizations

P OWER -1 and P OWER -2 described in Section 4.1. Note that tuning generic policies

is a particular case of learning with numerical parameters and that both learned poli-

cies and tuned generic policies make use of the same prior knowledge. To make our

comparison between these two kinds of policies fair, we always use the same training

procedure, which is Algorithm 2 with i max = 100 iterations, n p = max (8 d, 40) can-

didate policies per iteration and b = n p / 4 best elements, where d is the number of

parameters to optimize. Having a linear dependency between n p and d is a classical

choice when using EDAs [14]. Note that, in most cases the optimization is solved in a

few or a few tens iterations. Our simulations have shown that i max = 100 is a careful

choice for ensuring that the optimization has enough time to properly converge. For

the baseline policies where some default values are advocated, we use these values as

initial expectation of the EDA Gaussians. Otherwise, the initial Gaussians are centered

on zero. Nothing is done to enforce the EDA to respect the constraints on the parame-

ters (e.g., c> 0 and 0 <d< 1 for n -G REEDY ). In practice, the EDA automatically

identifies interesting regions of the search space that respect these constraints.

Learning Symbolic Policies . We apply our symbolic learning approach with a maximal

formula length of L =7 , which leads to a set of

≥

33 , 5 millions of formulas. We

have applied the approximate partitioning approach described in Section 5.2 on these

formulas using d = 1024 samples to discriminate among strategies. This has resulted in

≈

|≈

9 , 5 million invalid formulas and M = 99020 distinct candidate E/E strategies (i.e.

distinct formula equivalence classes). To identify the best of those distinct strategies,

we apply the UCB1-T UNED algorithm for 10 7 steps. In our experiments, we report the

two best found policies, which we denote F ORMULA -1 and F ORMULA -2.

6.2 Performance Comparison

Table 1 reports the results we obtain for untuned generic policies, tuned generic policies

and learned policies on distributions

D P with horizons T

D P and

∈{

10 , 100 , 1000

}

For both tuned and learned policies, we consider three different training horizons

{

10 ,

100 , 1000

}

to see the effect of a mismatch between the training and the testing horizon.

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home