Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

There are in our opinion several research directions that could be investigated for still

improving the algorithm for learning policies proposed in this paper. For example, we

found out that problems similar to the problem of overfitting met in supervised learn-

ing could occur when considering a too large set of candidate polices. This naturally

calls for studying whether our learning approach could be combined with regulariza-

tion techniques. Along this idea, more sophisticated optimizers could also be thought

of for identifying in the set of candidate policies, the one which is predicted to behave

at best.

The UCB1, UCB2, UCB-V, KL-UCB and n -G REEDY policies used for compari-

son were shown (under certain conditions) to have interesting bounds on their expected

regret in asymptotic conditions (very large T ) while we did not aim at providing such

bounds for our learned policies. It would certainly be relevant to investigate whether

similar bounds could be derived for our learned policies or, alternatively, to see how the

approach could be adapted so as to target policies offering such theoretical performance

guarantees in asymptotic conditions. For example, better bounds on the expected regret

could perhaps be obtained by identifying in a set of candidate policies the one that gives

the smallest maximal value of the expected regret over this set rather than the one that

gives the best average performances.

Finally, while our paper has provided simulation results in the context of the most

simple multi-armed bandit setting, our exploration/exploitation policy meta-learning

scheme can also in principle be applied to any other exploration-exploitation problem.

In this line of research, the extension of this investigation to (finite) Markov Deci-

sion Processes studied in [15], suggests already that our approach to meta-learning E/E

strategies can be successful on much more complex settings.

References

1. Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of The American

Mathematical Society 58, 527-536 (1952)

2. Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied

Mathematics 6, 4-22 (1985)

3. Agrawal, R.: Sample mean based index policies with o(log n) regret for the multi-armed

bandit problem. Advances in Applied Mathematics 27, 1054-1078 (1995)

4. Auer, P., Fischer, P., Cesa-Bianchi, N.: Finite-time analysis of the multi-armed bandit prob-

lem. Machine Learning 47, 235-256 (2002)

5. Audibert, J.-Y., Munos, R., Szepesvari, C.: Tuning Bandit Algorithms in Stochastic Envi-

ronments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI),

vol. 4754, pp. 150-165. Springer, Heidelberg (2007)

6. Audibert, J., Munos, R., Szepesvari, C.: Exploration-exploitation trade-off using variance

estimates in multi-armed bandits. In: Theoretical Computer Science (2008)

7. Maes, F., Wehenkel, L., Ernst, D.: Learning to play K-armed bandit problems. In: Proc. of

the 4th International Conference on Agents and Artificial Intelligence (2012)

8. Maes, F., Wehenkel, L., Ernst, D.: Automatic Discovery of Ranking Formulas for Playing

with Multi-armed Bandits. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS, vol. 7188,

pp. 5-17. Springer, Heidelberg (2012)

9. Gonzalez, C., Lozano, J., Larranaga, P.: Estimation of Distribution Algorithms. A New Tool

for Evolutionary Computation. Kluwer Academic Publishers (2002)

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home