Information Technology Reference
In-Depth Information
There are in our opinion several research directions that could be investigated for still
improving the algorithm for learning policies proposed in this paper. For example, we
found out that problems similar to the problem of overfitting met in supervised learn-
ing could occur when considering a too large set of candidate polices. This naturally
calls for studying whether our learning approach could be combined with regulariza-
tion techniques. Along this idea, more sophisticated optimizers could also be thought
of for identifying in the set of candidate policies, the one which is predicted to behave
at best.
The UCB1, UCB2, UCB-V, KL-UCB and n -G REEDY policies used for compari-
son were shown (under certain conditions) to have interesting bounds on their expected
regret in asymptotic conditions (very large T ) while we did not aim at providing such
bounds for our learned policies. It would certainly be relevant to investigate whether
similar bounds could be derived for our learned policies or, alternatively, to see how the
approach could be adapted so as to target policies offering such theoretical performance
guarantees in asymptotic conditions. For example, better bounds on the expected regret
could perhaps be obtained by identifying in a set of candidate policies the one that gives
the smallest maximal value of the expected regret over this set rather than the one that
gives the best average performances.
Finally, while our paper has provided simulation results in the context of the most
simple multi-armed bandit setting, our exploration/exploitation policy meta-learning
scheme can also in principle be applied to any other exploration-exploitation problem.
In this line of research, the extension of this investigation to (finite) Markov Deci-
sion Processes studied in [15], suggests already that our approach to meta-learning E/E
strategies can be successful on much more complex settings.
References
1. Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of The American
Mathematical Society 58, 527-536 (1952)
2. Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied
Mathematics 6, 4-22 (1985)
3. Agrawal, R.: Sample mean based index policies with o(log n) regret for the multi-armed
bandit problem. Advances in Applied Mathematics 27, 1054-1078 (1995)
4. Auer, P., Fischer, P., Cesa-Bianchi, N.: Finite-time analysis of the multi-armed bandit prob-
lem. Machine Learning 47, 235-256 (2002)
5. Audibert, J.-Y., Munos, R., Szepesvari, C.: Tuning Bandit Algorithms in Stochastic Envi-
ronments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI),
vol. 4754, pp. 150-165. Springer, Heidelberg (2007)
6. Audibert, J., Munos, R., Szepesvari, C.: Exploration-exploitation trade-off using variance
estimates in multi-armed bandits. In: Theoretical Computer Science (2008)
7. Maes, F., Wehenkel, L., Ernst, D.: Learning to play K-armed bandit problems. In: Proc. of
the 4th International Conference on Agents and Artificial Intelligence (2012)
8. Maes, F., Wehenkel, L., Ernst, D.: Automatic Discovery of Ranking Formulas for Playing
with Multi-armed Bandits. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS, vol. 7188,
pp. 5-17. Springer, Heidelberg (2012)
9. Gonzalez, C., Lozano, J., Larranaga, P.: Estimation of Distribution Algorithms. A New Tool
for Evolutionary Computation. Kluwer Academic Publishers (2002)
 
Search WWH ::




Custom Search