Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

Learning Exploration/Exploitation Strategies

Instead of relying on a fixed E/E strategy to solve a given class of problems, we pro-

pose a systematic approach to exploit prior knowledge by learning E/E strategies in a

problem-driven way. We now state our learning approach in abstract terms.

Prior knowledge is represented as a distribution

D P over bandit problems P =

( ν 1 ,...,ν K ) . From this distribution, we can sample as many training problems as de-

sired. In order to learn E/E strategies exploiting this knowledge, we rely on a parametric

family of candidate strategies Π Θ ⊂{

} H whose members are policies π θ

1 , 2 ,...,K

that are fully defined given parameters θ

∈

Θ .Given Π Θ , the learning problem aims at

solving:

θ ∗ =argmin

R P,T }}

E P∼D P {

{

(6)

∈

R P,T }

where E

is the expected cumulative regret of π on problem P and where T is the

(a-priori given) time playing horizon. Solving this minimization problem is non trivial

since it involves an expectation over an infinite number of problems. Furthermore, given

a problem P , computing E

{

R P,T }

relies on the expected values of T k ( T ) ,whichwe

cannot compute exactly in the general case. Therefore, we propose to approximate the

expected cumulative regret by the empirical mean regret over a finite set of training

problems P (1) ,...,P ( N )

{

D P :

from

Δ ( π θ ) where Δ ( π )= 1

θ ∗ =argmin

θ∈Θ

R P ( i ) ,T ,

(7)

i =1

and where R π P ( i ) ,T values are estimated performing a single trajectory of π θ on problem

P . Note that the number of training problems N will typically be large in order to make

the variance Δ (

) reasonably small.

In order to instantiate this approach, two components have to be provided: the hy-

pothesis space Π Θ and the optimization algorithm to solve Eq. 7. The next two sections

describe different instantiations of these components.

Numeric Parameterization

We now instantiate our meta-learning approach by considering E/E strategies that have

numerical parameters.

4.1

Policy Search Space

To define the parametric family of candidate policies Π Θ , we use index functions ex-

pressed as linear combinations of history features. These index functions rely on an

history feature function φ :

d , that describes the history w.r.t. a

given arm as a vector of scalar features. Given the function φ (

H×{

1 , 2 ,...,K

}→

) , index functions are

defined by

index θ ( H t ,k )=

θ,φ ( H t ,k )

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home