Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

where θ

∈

d are parameters and

is the classical dot product operator. The set

of candidate policies Π Θ is composed of all index-based policies obtained with such

index functions given parameters θ

d .

History features may describe any aspect of the history, including empirical reward

moments, current time step, arm play counts or combinations of these variables. The set

of such features should not be too large to avoid parameter estimation difficulties, but it

should be large enough to provide the support for a rich set of E/E strategies. We here

propose one possibility for defining the history feature function, that can be applied to

any multi-armed problem and that is shown to perform well in Section 6.

∈

To compute φ ( H t ,k ) , we first compute the following four variables: v 1 = √ ln t,

v 2 =1 / √ t k ,v 3 = r k and v 4 = σ k , i.e. the square root of the logarithm of the current

time step, the inverse square root of the number of times arm k has been played, the

empirical mean and standard deviation of the rewards obtained so far by arm k .

Then, these variables are multiplied in different ways to produce features. The num-

ber of these combinations is controlled by a parameter P whose default value is 1 .

Given P , there is one feature f i,j,k,l per possible combinations of values of i,j,k,l

∈

, which is defined as follows: f i,j,k,l = v 1 v j

2 v 3 v 4 .

In other terms, there is one feature per possible polynomial up to degree P using

variables v 1 ,...,v 4 . In the following, we denote P OWER -1 (resp., P OWER -2) the policy

learned using function φ ( H t ,k ) with parameter P =1 (resp., P =2 ). The index

function that underlies these policies can be written as following:

{

0 ,...,P

}

θ i,j,k,l v 1 v j

index power−P ( H t ,k )=

2 v 3 v 4

(8)

i =0

j =0

k =0

l =0

where θ i,j,k,l are the learned parameters. The P OWER -1 policy has 16 such parameters

and the P OWER -2 has 81 parameters.

4.2

Optimisation Algorithm

We now discuss the optimization of Equation 7 in the case of our numerical parame-

terization. Note that the objective function we want to optimize, in addition to being

stochastic, has a complex relation with the parameters θ . A slight change in the pa-

rameter vector θ may lead to significantly different bandit episodes and expected regret

values. Local optimization approaches may thus not be appropriate here. Instead, we

suggest the use of derivative-free global optimization algorithms.

In this work, we use a powerful, yet simple, class of global optimization algo-

rithms known as cross-entropy and also known as Estimation of Distribution Algorithms

(EDA) [9]. EDAs rely on a probabilistic model to describe promising regions of the

search space and to sample good candidate solutions. This is performed by repeating

iterations that first sample a population of n p candidates using the current probabilistic

model and then fit a new probabilistic model given the b<n p best candidates.

Any kind of probabilistic model may be used inside an EDA. The simplest form of

EDAs uses one marginal distribution per variable to optimize and is known as the uni-

variate marginal distribution algorithm [10]. We have adopted this approach by using

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home