Information Technology Reference
In-Depth Information
Algorithm 1
.
Generic index-based discrete bandit policy
1: Given scoring function
index
:
H×{
1
,
2
,...,K}→
,
2:
for
t
=1
to
K
do
3:
Play bandit
b
t
=
t
Initialization: play each bandit once
4: Observe reward
r
t
5:
end for
6:
for
t
=
K
to
T
do
7: Play bandit
b
t
=argmax
k∈{
1
,
2
,...,K}
index
(
H
t−
1
,t
)
8: Observe reward
r
t
9:
end for
2.2
Index-Based Bandit Policies
Index-based bandit policies are based on a ranking
index
that computes for each arm
k
a numerical value based on the sub-history of responses
H
t−
1
of that arm gathered
at time
t
. These policies are sketched in Algorithm 1 and work as follows. During
the first
K
plays, they play sequentially the machines
1
,
2
,...,K
to perform initial-
ization. In all subsequent plays, these policies compute for every machine
k
the score
index
(
H
t−
1
,t
)
that depends on the observed sub-history
H
t−
1
of arm
k
and pos-
sibly on
t
. At each step
t
, the arm with the largest score is selected (ties are broken at
random).
Here are some examples of popular index functions:
index
UCB1
(
H
t−
1
,t
)=
r
k
+
C
ln
t
t
k
∈
(2)
ln
t
t
k
min
1
/
4
, σ
k
+
2ln
t
index
UCB1-T
UNED
(
H
t−
1
,t
)=
r
k
+
(3)
t
k
16
t
k
σ
k
t
k
−
ln(
t
−
1)
index
UCB1-N
ORMAL
(
H
t−
1
,t
)=
r
k
+
(4)
1
t
k
2
σ
k
ζ
ln
t
t
k
+
c
3
ζ
ln
t
t
k
index
UCB-V
(
H
t−
1
,t
)=
r
k
+
(5)
where
r
k
and
σ
k
are the mean and standard deviation of the rewards so far obtained
from arm
k
and
t
k
is the number of times it has been played.
Policies UCB1, UCB1-T
UNED
and UCB1-N
ORMAL
1
have been proposed by [4].
UCB1 has one parameter
C>
0
whose typical value is 2. Policy UCB-V has been
proposed by [5] and has two parameters
ζ>
0
and
c>
0
. We refer the reader to [4, 5]
for detailed explanations of these parameters. Note that these index function are the
sum of an exploitation term to give preference on arms with high reward mean (
r
k
)
and an exploration term that aims at playing arms to gather more information on their
underlying reward distribution (which is typically an upper confidence term).
1
Note that this index-based policy does not strictly fit inside Algorithm 1 as it uses an additional
condition to play bandits that were not played since a long time.