Information Technology Reference
In-Depth Information
erroneous prior assumptions, by testing the E/E strategies learnt for Bernoulli arms on
bandits with rewards following a truncated Gaussian distribution.
The ideas presented in this paper take their roots in two previously published pa-
pers. The idea of learning multi-armed bandit policies using global optimization and
numerically parameterized index-based policies was first proposed in [7]. Searching
good multi-armed bandit policies in a formula space was first proposed in [8]. Com-
pared to this previous work, we adopt here a unifying perspective, which is the learning
of E/E strategies from prior knowledge. We also introduce an improved optimization
procedure for formula search, based on equivalence classes identification and on a pure
exploration multi-armed problem formalization.
This paper is structured as follows. We first formally define the multi-armed bandit
problem and introduce index-based policies in Section 2. Section 3 formally states of E/E
strategy learning problem. Section 4 and Section 5 present the numerical and symbolic
instantiation of our learning approach, respectively. Section 6 reports on experimental
results. Finally, we conclude and present future research directions in Section 7.
2
Multi-armed Bandit Problem and Policies
We now formally describe the (discrete) multi-armed bandit problem and the class of
index-based policies.
2.1
The Multi-armed Bandit Problem
We denote by i
2 ) arms of the bandit problem, by ν i the
reward distribution of arm i , and by μ i its expected value; b t is the arm played at round
t ,and r t
∈{
1 , 2 ,...,K
}
the ( K
ν b t is the obtained reward. H t =[ b 1 ,r 1 ,b 2 ,r 2 ,...,b t ,r t ] is a vector that
gathers the history over the first t plays, and we denote by
H
the set of all possible
histories of any length t . An E/E strategy (or policy) π :
H→{
1 , 2 ,...,K
}
is an
algorithm that processes at play t the vector H t− 1 to select the arm b t ∈{
1 , 2 ,...,K
}
:
b t = π ( H t− 1 ) .
The regret of the policy π after T plays is defined by: R T = t =1 r t , where
μ =max k μ k refers to the expected reward of the optimal arm. The expected value
of the regret represents the expected loss due to the fact that the policy does not always
play the best machine. It can be written as:
K
R T }
( μ
E
{
=
E
{
T k ( T )
}
μ k ) ,
(1)
k =1
where T k ( T ) denotes the number of times the policy has drawn arm k on the first T
rounds.
The multi-armed bandit problem aims at finding a policy π that for a given K
minimizes the expected regret (or, in other words, maximizes the expected reward),
ideally for any T and any
i =1 .
{
ν i }
Search WWH ::




Custom Search