Information Technology Reference
In-Depth Information
3
Learning Exploration/Exploitation Strategies
Instead of relying on a fixed E/E strategy to solve a given class of problems, we pro-
pose a systematic approach to exploit prior knowledge by learning E/E strategies in a
problem-driven way. We now state our learning approach in abstract terms.
Prior knowledge is represented as a distribution
D
P
over bandit problems
P
=
(
ν
1
,...,ν
K
)
. From this distribution, we can sample as many training problems as de-
sired. In order to learn E/E strategies exploiting this knowledge, we rely on a parametric
family of candidate strategies
Π
Θ
⊂{
}
H
whose members are policies
π
θ
1
,
2
,...,K
that are fully defined given parameters
θ
∈
Θ
.Given
Π
Θ
, the learning problem aims at
solving:
θ
∗
=argmin
θ
R
P,T
}}
E
P∼D
P
{
E
{
,
(6)
∈
Θ
R
P,T
}
where
E
is the expected cumulative regret of
π
on problem
P
and where
T
is the
(a-priori given) time playing horizon. Solving this minimization problem is non trivial
since it involves an expectation over an infinite number of problems. Furthermore, given
a problem
P
, computing
E
{
R
P,T
}
relies on the expected values of
T
k
(
T
)
,whichwe
cannot compute exactly in the general case. Therefore, we propose to approximate the
expected cumulative regret by the empirical mean regret over a finite set of training
problems
P
(1)
,...,P
(
N
)
{
D
P
:
from
N
Δ
(
π
θ
)
where
Δ
(
π
)=
1
N
θ
∗
=argmin
θ∈Θ
R
P
(
i
)
,T
,
(7)
i
=1
and where
R
π
P
(
i
)
,T
values are estimated performing a single trajectory of
π
θ
on problem
P
. Note that the number of training problems
N
will typically be large in order to make
the variance
Δ
(
)
reasonably small.
In order to instantiate this approach, two components have to be provided: the hy-
pothesis space
Π
Θ
and the optimization algorithm to solve Eq. 7. The next two sections
describe different instantiations of these components.
·
4
Numeric Parameterization
We now instantiate our meta-learning approach by considering E/E strategies that have
numerical parameters.
4.1
Policy Search Space
To define the parametric family of candidate policies
Π
Θ
, we use index functions ex-
pressed as linear combinations of history features. These index functions rely on an
history feature function
φ
:
d
, that describes the history
w.r.t.
a
given arm as a vector of scalar features. Given the function
φ
(
H×{
1
,
2
,...,K
}→
·
,
·
)
, index functions are
defined by
index
θ
(
H
t
,k
)=
θ,φ
(
H
t
,k
)
,