Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

after T plays). A rational (and risk-neutral) gambler knowing the reward distributions

of the K arms would play at every stage an arm with maximal expected reward, so as

to maximize his expected cumulative reward (irrespectively of the number K of arms,

his number T of coins, and the variances of the reward distributions). When reward

distributions are unknown, it is less trivial to decide how to play optimally since two

contradictory goals compete: exploration consists in trying an arm to acquire knowl-

edge on its expected reward, while exploitation consists in using the current knowledge

to decide which arm to play. How to balance the effort towards these two goals is the

essence of the E/E dilemma, which is specially difficult when imposing a finite number

of playing opportunities T .

Most theoretical works about multi-armed bandit problem have focused on the de-

sign of generic E/E strategies which are provably optimal in asymptotic conditions

(large T ), while assuming only very unrestrictive conditions on the reward distributions

(e.g., bounded support). Among these, some strategies work by computing at every play

a quantity called “upper confidence index” for each arm that depends on the rewards

collected so far by this arm, and by selecting for the next play (or round of plays) the

arm with the highest index. Such E/E strategies are called index-based policies and have

been initially introduced by [2] where the indices were difficult to compute. More easy

to compute indices where proposed later on [3-5].

Index-based policies typically involve hyper-parameters whose values impact their

relative performances. Usually, when reporting simulation results, authors manually

tuned these values on problems that share similarities with their test problems (e.g.,

the same type of distributions for generating the rewards) by running trial-and-error

simulations [4, 6]. By doing so, they actually used prior information on the problems to

select the hyper-parameters.

Starting from these observations, we elaborated an approach for learning in a repro-

ducible way good policies for playing multi-armed bandit problems over finite horizons.

This approach explicitly models and then exploits the prior information on the target set

of multi-armed bandit problems. We assume that this prior knowledge is represented as

a distribution over multi-armed bandit problems, from which we can draw any number

of training problems. Given this distribution, meta-learning consists in searching in a

chosen set of candidate E/E strategies one that yields optimal expected performances.

This approach allows to automatically tune hyper-parameters of existing index-based

policies. But, more importantly, it opens the door for searching within much broader

classes of E/E strategies one that is optimal for a given set of problems compliant with

the prior information. We propose two such hypothesis spaces composed of index-based

policies: in the first one, the index function is a linear function of features and whose

meta-learnt parameters are real numbers, while in the second one it is a function gener-

ated by a grammar of symbolic formulas.

We empirically show, in the case of Bernoulli arms, that when the number K of

arms and the playing horizon T are fully specified a priori, learning enables to obtain

policies that significantly outperform a wide range of previously proposed generic poli-

cies (UCB1, UCB1-T UNED , UCB2, UCB-V, KL-UCB and n -G REEDY ), even after

careful tuning. We also evaluate the robustness of the learned policies with respect to

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home