Information Technology Reference
In-Depth Information
each arm uniformly in range [0 , 1] . Rewards are then sampled using a rejection sam-
pling approach: samples are drawn from the corresponding Gaussian distribution until
obtaining a value that belongs to the interval [0 , 1] .
Generic Policies . We consider the following generic policies: the n -G REEDY policy
as described in [4], the policies introduced by [4]: UCB1, UCB1-T UNED , UCB1-
N ORMAL and UCB2, the policy KL-UCB introduced in [13] and the policy UCB-
V proposed by [5]. Except n -G REEDY , all these policies belong to the family of
index-based policies discussed previously. UCB1-T UNED and UCB1-N ORMAL are
parameter-free policies designed for bandit problems with Bernoulli distributions and
for problems with Gaussian distributions respectively. All the other policies have hyper-
parameters that can be tuned to improve the quality of the policy. n -G REEDY has two
parameters c> 0 and 0 <d< 1 , UCB2 has one parameter 0 <α< 1 ,KL-UCBhas
one parameter c
0 and UCB-V has two parameters ζ> 0 and c> 0 . We refer the
reader to [4, 5, 13] for detailed explanations of these parameters.
Learning Numerical Policies . We learn policies using the two parameterizations
P OWER -1 and P OWER -2 described in Section 4.1. Note that tuning generic policies
is a particular case of learning with numerical parameters and that both learned poli-
cies and tuned generic policies make use of the same prior knowledge. To make our
comparison between these two kinds of policies fair, we always use the same training
procedure, which is Algorithm 2 with i max = 100 iterations, n p = max (8 d, 40) can-
didate policies per iteration and b = n p / 4 best elements, where d is the number of
parameters to optimize. Having a linear dependency between n p and d is a classical
choice when using EDAs [14]. Note that, in most cases the optimization is solved in a
few or a few tens iterations. Our simulations have shown that i max = 100 is a careful
choice for ensuring that the optimization has enough time to properly converge. For
the baseline policies where some default values are advocated, we use these values as
initial expectation of the EDA Gaussians. Otherwise, the initial Gaussians are centered
on zero. Nothing is done to enforce the EDA to respect the constraints on the parame-
ters (e.g., c> 0 and 0 <d< 1 for n -G REEDY ). In practice, the EDA automatically
identifies interesting regions of the search space that respect these constraints.
Learning Symbolic Policies . We apply our symbolic learning approach with a maximal
formula length of L =7 , which leads to a set of
33 , 5 millions of formulas. We
have applied the approximate partitioning approach described in Section 5.2 on these
formulas using d = 1024 samples to discriminate among strategies. This has resulted in
|
Θ
|≈
9 , 5 million invalid formulas and M = 99020 distinct candidate E/E strategies (i.e.
distinct formula equivalence classes). To identify the best of those distinct strategies,
we apply the UCB1-T UNED algorithm for 10 7 steps. In our experiments, we report the
two best found policies, which we denote F ORMULA -1 and F ORMULA -2.
6.2 Performance Comparison
Table 1 reports the results we obtain for untuned generic policies, tuned generic policies
and learned policies on distributions
D P with horizons T
D P and
∈{
10 , 100 , 1000
}
.
For both tuned and learned policies, we consider three different training horizons
{
10 ,
100 , 1000
}
to see the effect of a mismatch between the training and the testing horizon.
 
Search WWH ::




Custom Search