Information Technology Reference
In-Depth Information
3
Learning Exploration/Exploitation Strategies
Instead of relying on a fixed E/E strategy to solve a given class of problems, we pro-
pose a systematic approach to exploit prior knowledge by learning E/E strategies in a
problem-driven way. We now state our learning approach in abstract terms.
Prior knowledge is represented as a distribution
D P over bandit problems P =
( ν 1 ,...,ν K ) . From this distribution, we can sample as many training problems as de-
sired. In order to learn E/E strategies exploiting this knowledge, we rely on a parametric
family of candidate strategies Π Θ ⊂{
} H whose members are policies π θ
1 , 2 ,...,K
that are fully defined given parameters θ
Θ .Given Π Θ , the learning problem aims at
solving:
θ =argmin
θ
R P,T }}
E P∼D P {
E
{
,
(6)
Θ
R P,T }
where E
is the expected cumulative regret of π on problem P and where T is the
(a-priori given) time playing horizon. Solving this minimization problem is non trivial
since it involves an expectation over an infinite number of problems. Furthermore, given
a problem P , computing E
{
R P,T }
relies on the expected values of T k ( T ) ,whichwe
cannot compute exactly in the general case. Therefore, we propose to approximate the
expected cumulative regret by the empirical mean regret over a finite set of training
problems P (1) ,...,P ( N )
{
D P :
from
N
Δ ( π θ ) where Δ ( π )= 1
N
θ =argmin
θ∈Θ
R P ( i ) ,T ,
(7)
i =1
and where R π P ( i ) ,T values are estimated performing a single trajectory of π θ on problem
P . Note that the number of training problems N will typically be large in order to make
the variance Δ (
) reasonably small.
In order to instantiate this approach, two components have to be provided: the hy-
pothesis space Π Θ and the optimization algorithm to solve Eq. 7. The next two sections
describe different instantiations of these components.
·
4
Numeric Parameterization
We now instantiate our meta-learning approach by considering E/E strategies that have
numerical parameters.
4.1
Policy Search Space
To define the parametric family of candidate policies Π Θ , we use index functions ex-
pressed as linear combinations of history features. These index functions rely on an
history feature function φ :
d , that describes the history w.r.t. a
given arm as a vector of scalar features. Given the function φ (
H×{
1 , 2 ,...,K
}→
·
,
·
) , index functions are
defined by
index θ ( H t ,k )=
θ,φ ( H t ,k )
,
 
Search WWH ::




Custom Search