Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

Algorithm 2 . EDA-based learning of a discrete bandit policy

Given the number of iterations i max ,

Given the population size n p ,

Given the number of best elements b ,

Given a sample of training bandit problems P (1) ,...,P ( N ) ,

Given an history-features function φ ( ·, · ) ∈

d ,

1: Set μ p =0 ,σ p =1 , ∀p ∈ [1 ,d ]

Initialize with normal Gaussians

2: for i ∈ [1 ,i max ] do

for j ∈ [1 ,n p ] do

Sample and evaluate new population

for p ∈ [1 ,d ] do

N ( μ p ,σ p )

θ p ←

sample from

end for

Estimate Δ ( π θ ) and store result ( θ,Δ ( π θ ))

end for

{θ (1) ,...,θ ( b ) }

Select

the b best candidate θ vectors w.r.t. their Δ ( · ) score

μ p ← b j =1 θ ( j p , ∀p ∈ [1 ,d ]

10:

Learn new Gaussians

σ p ← b j =1 ( θ ( j )

− μ p ) 2 , ∀p ∈ [1 ,d ]

11:

12: end for

13: return The policy π θ that led to the lowest observed value of Δ ( π θ )

( μ p ,σ p ) for each parameter θ p . Although this approach is

simple, it proved to be quite effective experimentally to solve Equation 7. The full de-

tails of our EDA-based policy learning procedure are given by Algorithm 2. The initial

distributions are standard Gaussian distributions

one Gaussian distribution

(0 , 1) . The policy that is returned

corresponds to the θ parameters that led to the lowest observed value of Δ ( π θ ) .

Symbolic Parametrization

The index functions from the literature depend on the current time step t andonthree

statistics extracted from the sub-history H t− 1 : r k , σ k and t k . We now propose a second

parameterization of our learning approach, in which we consider all index functions that

can be constructed using small formulas built upon these four variables.

5.1

Policy Search Space

We consider index functions that are given in the form of small, closed-form formulas.

Closed-form formulas have several advantages: they can be easily computed, they can

formally be analyzed and they are easily interpretable.

Let us first explicit the set of formulas

that we consider in this paper. A formula

is:

- either a binary expression F = B ( F ,F ) ,where B belongs to a set of binary

operators

∈ F

and F and F are also formulas from

- or a unary expression F = U ( F ) where U belongs to a set of unary operators

and F ∈ F

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home