Information Technology Reference
In-Depth Information
where θ
d are parameters and
·
,
·
is the classical dot product operator. The set
of candidate policies Π Θ is composed of all index-based policies obtained with such
index functions given parameters θ
d .
History features may describe any aspect of the history, including empirical reward
moments, current time step, arm play counts or combinations of these variables. The set
of such features should not be too large to avoid parameter estimation difficulties, but it
should be large enough to provide the support for a rich set of E/E strategies. We here
propose one possibility for defining the history feature function, that can be applied to
any multi-armed problem and that is shown to perform well in Section 6.
To compute φ ( H t ,k ) , we first compute the following four variables: v 1 = ln t,
v 2 =1 / t k ,v 3 = r k and v 4 = σ k , i.e. the square root of the logarithm of the current
time step, the inverse square root of the number of times arm k has been played, the
empirical mean and standard deviation of the rewards obtained so far by arm k .
Then, these variables are multiplied in different ways to produce features. The num-
ber of these combinations is controlled by a parameter P whose default value is 1 .
Given P , there is one feature f i,j,k,l per possible combinations of values of i,j,k,l
, which is defined as follows: f i,j,k,l = v 1 v j
2 v 3 v 4 .
In other terms, there is one feature per possible polynomial up to degree P using
variables v 1 ,...,v 4 . In the following, we denote P OWER -1 (resp., P OWER -2) the policy
learned using function φ ( H t ,k ) with parameter P =1 (resp., P =2 ). The index
function that underlies these policies can be written as following:
{
0 ,...,P
}
P
P
P
P
θ i,j,k,l v 1 v j
index power−P ( H t ,k )=
2 v 3 v 4
(8)
i =0
j =0
k =0
l =0
where θ i,j,k,l are the learned parameters. The P OWER -1 policy has 16 such parameters
and the P OWER -2 has 81 parameters.
4.2
Optimisation Algorithm
We now discuss the optimization of Equation 7 in the case of our numerical parame-
terization. Note that the objective function we want to optimize, in addition to being
stochastic, has a complex relation with the parameters θ . A slight change in the pa-
rameter vector θ may lead to significantly different bandit episodes and expected regret
values. Local optimization approaches may thus not be appropriate here. Instead, we
suggest the use of derivative-free global optimization algorithms.
In this work, we use a powerful, yet simple, class of global optimization algo-
rithms known as cross-entropy and also known as Estimation of Distribution Algorithms
(EDA) [9]. EDAs rely on a probabilistic model to describe promising regions of the
search space and to sample good candidate solutions. This is performed by repeating
iterations that first sample a population of n p candidates using the current probabilistic
model and then fit a new probabilistic model given the b<n p best candidates.
Any kind of probabilistic model may be used inside an EDA. The simplest form of
EDAs uses one marginal distribution per variable to optimize and is known as the uni-
variate marginal distribution algorithm [10]. We have adopted this approach by using
 
Search WWH ::




Custom Search