Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence - page 98

Information Technology Reference

In-Depth Information

Algorithm 1 . Generic index-based discrete bandit policy

1: Given scoring function index : H×{ 1 , 2 ,...,K}→

,

2: for t =1 to K do

3:

Play bandit b t = t

Initialization: play each bandit once

4: Observe reward r t

5: end for

6: for t = K to T do

7: Play bandit b t =argmax k∈{ 1 , 2 ,...,K} index ( H t− 1 ,t )

8: Observe reward r t

9: end for

2.2

Index-Based Bandit Policies

Index-based bandit policies are based on a ranking index that computes for each arm

k a numerical value based on the sub-history of responses H t− 1 of that arm gathered

at time t . These policies are sketched in Algorithm 1 and work as follows. During

the first K plays, they play sequentially the machines 1 , 2 ,...,K to perform initial-

ization. In all subsequent plays, these policies compute for every machine k the score

index ( H t− 1 ,t )

that depends on the observed sub-history H t− 1 of arm k and pos-

sibly on t . At each step t , the arm with the largest score is selected (ties are broken at

random).

Here are some examples of popular index functions:

index UCB1 ( H t− 1 ,t )= r k + C ln t

t k

∈

(2)

ln t

t k min 1 / 4 , σ k + 2ln t

index UCB1-T UNED ( H t− 1 ,t )= r k +

(3)

t k

16 t k σ k

t k −

ln( t

−

1)

index UCB1-N ORMAL ( H t− 1 ,t )= r k +

(4)

1

t k

2 σ k ζ ln t

t k

+ c 3 ζ ln t

t k

index UCB-V ( H t− 1 ,t )= r k +

(5)

where r k and σ k are the mean and standard deviation of the rewards so far obtained

from arm k and t k is the number of times it has been played.

Policies UCB1, UCB1-T UNED and UCB1-N ORMAL 1 have been proposed by [4].

UCB1 has one parameter C> 0 whose typical value is 2. Policy UCB-V has been

proposed by [5] and has two parameters ζ> 0 and c> 0 . We refer the reader to [4, 5]

for detailed explanations of these parameters. Note that these index function are the

sum of an exploitation term to give preference on arms with high reward mean ( r k )

and an exploration term that aims at playing arms to gather more information on their

underlying reward distribution (which is typically an upper confidence term).

1

Note that this index-based policy does not strictly fit inside Algorithm 1 as it uses an additional

condition to play bandits that were not played since a long time.

Next Page

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home