Information Technology Reference
In-Depth Information
Algorithm 1 . Generic index-based discrete bandit policy
1: Given scoring function index : H×{ 1 , 2 ,...,K}→
￿
,
2: for t =1 to K do
3:
Play bandit b t = t
Initialization: play each bandit once
4: Observe reward r t
5: end for
6: for t = K to T do
7: Play bandit b t =argmax k∈{ 1 , 2 ,...,K} index ( H t− 1 ,t )
8: Observe reward r t
9: end for
2.2
Index-Based Bandit Policies
Index-based bandit policies are based on a ranking index that computes for each arm
k a numerical value based on the sub-history of responses H t− 1 of that arm gathered
at time t . These policies are sketched in Algorithm 1 and work as follows. During
the first K plays, they play sequentially the machines 1 , 2 ,...,K to perform initial-
ization. In all subsequent plays, these policies compute for every machine k the score
index ( H t− 1 ,t )
that depends on the observed sub-history H t− 1 of arm k and pos-
sibly on t . At each step t , the arm with the largest score is selected (ties are broken at
random).
Here are some examples of popular index functions:
index UCB1 ( H t− 1 ,t )= r k + C ln t
t k
(2)
ln t
t k min 1 / 4 , σ k + 2ln t
index UCB1-T UNED ( H t− 1 ,t )= r k +
(3)
t k
16 t k σ k
t k
ln( t
1)
index UCB1-N ORMAL ( H t− 1 ,t )= r k +
(4)
1
t k
2 σ k ζ ln t
t k
+ c 3 ζ ln t
t k
index UCB-V ( H t− 1 ,t )= r k +
(5)
where r k and σ k are the mean and standard deviation of the rewards so far obtained
from arm k and t k is the number of times it has been played.
Policies UCB1, UCB1-T UNED and UCB1-N ORMAL 1 have been proposed by [4].
UCB1 has one parameter C> 0 whose typical value is 2. Policy UCB-V has been
proposed by [5] and has two parameters ζ> 0 and c> 0 . We refer the reader to [4, 5]
for detailed explanations of these parameters. Note that these index function are the
sum of an exploitation term to give preference on arms with high reward mean ( r k )
and an exploration term that aims at playing arms to gather more information on their
underlying reward distribution (which is typically an upper confidence term).
1
Note that this index-based policy does not strictly fit inside Algorithm 1 as it uses an additional
condition to play bandits that were not played since a long time.
 
Search WWH ::




Custom Search