Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

erroneous prior assumptions, by testing the E/E strategies learnt for Bernoulli arms on

bandits with rewards following a truncated Gaussian distribution.

The ideas presented in this paper take their roots in two previously published pa-

pers. The idea of learning multi-armed bandit policies using global optimization and

numerically parameterized index-based policies was first proposed in [7]. Searching

good multi-armed bandit policies in a formula space was first proposed in [8]. Com-

pared to this previous work, we adopt here a unifying perspective, which is the learning

of E/E strategies from prior knowledge. We also introduce an improved optimization

procedure for formula search, based on equivalence classes identification and on a pure

exploration multi-armed problem formalization.

This paper is structured as follows. We first formally define the multi-armed bandit

problem and introduce index-based policies in Section 2. Section 3 formally states of E/E

strategy learning problem. Section 4 and Section 5 present the numerical and symbolic

instantiation of our learning approach, respectively. Section 6 reports on experimental

results. Finally, we conclude and present future research directions in Section 7.

Multi-armed Bandit Problem and Policies

We now formally describe the (discrete) multi-armed bandit problem and the class of

index-based policies.

2.1

The Multi-armed Bandit Problem

We denote by i

2 ) arms of the bandit problem, by ν i the

reward distribution of arm i , and by μ i its expected value; b t is the arm played at round

t ,and r t ∼

∈{

1 , 2 ,...,K

}

the ( K

≥

ν b t is the obtained reward. H t =[ b 1 ,r 1 ,b 2 ,r 2 ,...,b t ,r t ] is a vector that

gathers the history over the first t plays, and we denote by

the set of all possible

histories of any length t . An E/E strategy (or policy) π :

H→{

1 , 2 ,...,K

}

is an

algorithm that processes at play t the vector H t− 1 to select the arm b t ∈{

1 , 2 ,...,K

}

b t = π ( H t− 1 ) .

The regret of the policy π after T plays is defined by: R T = Tμ ∗ − t =1 r t , where

μ ∗ =max k μ k refers to the expected reward of the optimal arm. The expected value

of the regret represents the expected loss due to the fact that the policy does not always

play the best machine. It can be written as:

R T }

( μ ∗ −

{

T k ( T )

}

μ k ) ,

(1)

k =1

where T k ( T ) denotes the number of times the policy has drawn arm k on the first T

rounds.

The multi-armed bandit problem aims at finding a policy π ∗ that for a given K

minimizes the expected regret (or, in other words, maximizes the expected reward),

ideally for any T and any

i =1 .

{

ν i }

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home