Information Technology Reference
In-Depth Information
(optimistic) term that decreases with t k . However, for the shortest time horizon ( T =
10 ), the policy found ( t k ( r k
1
2 ) ) is totally different from UCB-type policies. With
such a policy, only the arms whose empirical reward mean is higher than a given thresh-
old (0.5) have positive index scores and are candidate for selection, i.e. making the
scores negative has the effect to kill bad arms. If the r k of an arm is above the thresh-
old, then the index associated with this arm will increase with the number of times it
is played and not decrease as it is the case for UCB policies. If all empirical means r k
are below the threshold, then for equal reward means, arms that have been less played
are preferred. This finding is amazing since it suggests that this optimistic paradigm for
multi-armed bandits upon which UCB policies are based may in fact not be adapted at
all to a context where the horizon is small.
Percentage of Wins Against UCB1-T UNED . Table 2 gives for each policy, its percentage
of wins against UCB1-T UNED , when trained with the same horizon as the test horizon.
To compute this percentage of wins, we evaluate the expected regret on each of the
10000 testing problems and count the number of problems for which the tested policy
outperforms UCB1-T UNED . We observe that by minimizing the expected regret, our
learned policies also reach high values of percentage of wins: 84.6 % for T = 100 and
91.3 % for T = 1000 . Note that, in our approach, it is easy to change the objective
function. So if the real applicative aim was to maximize the percentage of wins against
UCB1-T UNED , this criterion could have been used directly in the policy optimization
stage to reach even better scores.
6.3
Computational Time
We used a C++ based implementation to perform our experiments. In the numerical
case with 10 cores at 1 . 9 Ghz , performing the whole learning of P OWER -1 took one
hour for T = 100 and ten hours for T = 1000 . In the symbolic case using a single core
at 1 . 9 Ghz , performing the whole learning took 22 minutes for T = 100 and a bit less
than three hours for T = 1000 . Note that the fact that symbolic learning is much faster
can be explained by two reasons. First, we tuned the EDA algorithm in a very careful
way to be sure to find a high quality solution; what we observe is that by using only
10% of this learning time, we already obtain close-to-optimal strategies. The second
factor is that our symbolic learning algorithm saves a lot of CPU time by being able to
rapidly reject bad strategies thanks to the multi-armed bandit formulation upon which
it relies.
7
Conclusions
The approach proposed in this paper for exploiting prior knowledge for learning ex-
ploration/exploitation policies has been tested for two-armed bandit problems with
Bernoulli reward distributions and when knowing the time horizon. The learned policies
were found to significantly outperform other policies previously published in the litera-
ture such as UCB1, UCB2, UCB-V, KL-UCB and n -G REEDY . The robustness of the
learned policies with respect to wrong information was also highlighted, by evaluating
them on two-armed bandits with truncated Gaussian reward distribution.
 
Search WWH ::




Custom Search