Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

(optimistic) term that decreases with t k . However, for the shortest time horizon ( T =

10 ), the policy found ( √ t k ( r k −

1

2 ) ) is totally different from UCB-type policies. With

such a policy, only the arms whose empirical reward mean is higher than a given thresh-

old (0.5) have positive index scores and are candidate for selection, i.e. making the

scores negative has the effect to kill bad arms. If the r k of an arm is above the thresh-

old, then the index associated with this arm will increase with the number of times it

is played and not decrease as it is the case for UCB policies. If all empirical means r k

are below the threshold, then for equal reward means, arms that have been less played

are preferred. This finding is amazing since it suggests that this optimistic paradigm for

multi-armed bandits upon which UCB policies are based may in fact not be adapted at

all to a context where the horizon is small.

Percentage of Wins Against UCB1-T UNED . Table 2 gives for each policy, its percentage

of wins against UCB1-T UNED , when trained with the same horizon as the test horizon.

To compute this percentage of wins, we evaluate the expected regret on each of the

10000 testing problems and count the number of problems for which the tested policy

outperforms UCB1-T UNED . We observe that by minimizing the expected regret, our

learned policies also reach high values of percentage of wins: 84.6 % for T = 100 and

91.3 % for T = 1000 . Note that, in our approach, it is easy to change the objective

function. So if the real applicative aim was to maximize the percentage of wins against

UCB1-T UNED , this criterion could have been used directly in the policy optimization

stage to reach even better scores.

6.3

Computational Time

We used a C++ based implementation to perform our experiments. In the numerical

case with 10 cores at 1 . 9 Ghz , performing the whole learning of P OWER -1 took one

hour for T = 100 and ten hours for T = 1000 . In the symbolic case using a single core

at 1 . 9 Ghz , performing the whole learning took 22 minutes for T = 100 and a bit less

than three hours for T = 1000 . Note that the fact that symbolic learning is much faster

can be explained by two reasons. First, we tuned the EDA algorithm in a very careful

way to be sure to find a high quality solution; what we observe is that by using only

10% of this learning time, we already obtain close-to-optimal strategies. The second

factor is that our symbolic learning algorithm saves a lot of CPU time by being able to

rapidly reject bad strategies thanks to the multi-armed bandit formulation upon which

it relies.

7

Conclusions

The approach proposed in this paper for exploiting prior knowledge for learning ex-

ploration/exploitation policies has been tested for two-armed bandit problems with

Bernoulli reward distributions and when knowing the time horizon. The learned policies

were found to significantly outperform other policies previously published in the litera-

ture such as UCB1, UCB2, UCB-V, KL-UCB and n -G REEDY . The robustness of the

learned policies with respect to wrong information was also highlighted, by evaluating

them on two-armed bandits with truncated Gaussian reward distribution.

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home