Information Technology Reference
In-Depth Information
Meta-learning of Exploration/Exploitation Strategies:
The Multi-armed Bandit Case
Francis Maes, Louis Wehenkel, and Damien Ernst
University of Liege, Dept. of Electrical Engineering and Computer Science
Institut Montefiore, B28, B-4000, Liege, Belgium
{
francis.maes,L.Wehenkel,dernst
}
@ulg.ac.be
Abstract.
The exploration/exploitation (E/E) dilemma arises naturally in many
subfields of Science. Multi-armed bandit problems formalize this dilemma in its
canonical form. Most current research in this field focuses on generic solutions
that can be applied to a wide range of problems. However, in practice, it is often
the case that a form of prior information is available about the specific class of
target problems. Prior knowledge is rarely used in current solutions due to the
lack of a systematic approach to incorporate it into the E/E strategy.
To address a specific class of E/E problems, we propose to proceed in three
steps: (i) model prior knowledge in the form of a probability distribution over the
target class of E/E problems; (ii) choose a large hypothesis space of candidate
E/E strategies; and (iii), solve an optimization problem to find a candidate E/E
strategy of maximal average performance over a sample of problems drawn from
the prior distribution.
We illustrate this meta-learning approach with two different hypothesis spaces:
one where E/E strategies are numerically parameterized and another where E/E
strategies are represented as small symbolic formulas. We propose appropriate
optimization algorithms for both cases. Our experiments, with two-armed
“Bernoulli” bandit problems and various playing budgets, show that the meta-
learnt E/E strategies outperform generic strategies of the literature (UCB1,
UCB1-T
UNED
, UCB-V, KL-UCB and
n
-G
REEDY
); they also evaluate the ro-
bustness of the learnt E/E strategies, by tests carried out on arms whose rewards
follow a truncated Gaussian distribution.
Keywords:
Exploration-exploitation dilemma, Prior knowledge, Multi-armed
bandit problems, Reinforcement learning.
1
Introduction
Exploration versus exploitation (E/E) dilemmas arise in many sub-fields of Science,
and in related fields such as artificial intelligence, finance, medicine and engineering.
In its most simple version, the multi-armed bandit problem formalizes this dilemma as
follows [1]: a gambler has
T
coins, and at each step he may choose among one of
K
slots (or arms) to allocate one of these coins, and then earns some money (his reward)
depending on the response of the machine he selected. Each arm response is character-
ized by an unknown probability distribution that is constant over time. The goal of the
gambler is to collect the largest cumulated reward once he has exhausted his coins (i.e.