Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

Instead, we propose a simple approximated solution, which consists in discriminating

formulas by comparing how they rank (in terms of values returned by the formula) a set

of d random samples of the variables r k , σ k ,t k ,t . More formally, the procedure is the

following:

1. we first build Θ , the space of all formulas f such that length ( f )

L ;

2. for i =1 ...d , we uniformly draw (within their respective domains) some random

realizations of the variables r k , σ k ,t k ,t that we concatenate into a vector Θ i ;

3. we cluster all formulas from Θ according to the following rule: two formulas F and

F belong to the same cluster if and only if they rank all the Θ i points in the same

order, i.e.:

≤

F ( Θ j ) .

Formulas leading to invalid index functions (caused for instance by division by zero

or logarithm of negative values) are discarded;

4. among each cluster, we select one formula of minimal length;

5. we gather all the selected minimal length formulas into an approximated reduced

set of formulas Θ .

F ( Θ i )

∀

i,j

∈{

1 ,...,d

}

= j, F ( Θ i )

≥

F ( Θ j )

⇐⇒

≥

In the following, we denote by M the cardinality of the approximate set of formulas

Θ =

Optimization Algorithm . A naive approach for finding the best formula f ∗ ∈

{

f 1 ,...,f M }

Θ would

Θ and simply return the best one. While ex-

tremely simple to implement, such an approach could reveal itself to be time-inefficient

in case of spaces Θ of large cardinality.

Preliminary experiments have shown us that Θ contains a majority of formulas lead-

ing to relatively bad performing index-based policies. It turns out that relatively few

samples of R P ( i ) ,T are sufficient to reject with high confidence these badly performing

formulas. In order to exploit this idea, a natural idea is to formalize the search for the

best formula as another multi-armed bandit problem. To each formula F k ∈

be to evaluate Δ ( f ) for each formula f

∈

Θ ,weas-

sociate an arm. Pulling the arm k consists in selecting a training problem P ( i ) and in

running one episode with the index-based policy whose index formula is f k . This leads

to a reward associated to arm k whose value is the quantity

−R P ( i ) ,T observed during

the episode. The purpose of multi-armed bandit algorithms is here to process the se-

quence of observed rewards to select in a smart way the next formula to be tried so that

when the budget of pulls has been exhausted, one (or several) high-quality formula(s)

can be identified.

In the formalization of Equation 7 as a multi-armed bandit problem, only the quality

of the finally suggested arm matters. How to select arms so as to identify the best one

in a finite amount of time is known as the pure exploration multi-armed bandit problem

[11]. It has been shown that index-based policies based on upper confidence bounds

were good policies for solving pure exploration bandit problems. Our optimization pro-

cedure works as follows: we use a bandit algorithm such as UCB1-T UNED during a

given number of steps and then return the policy that corresponds to the formula f k

with highest expected reward r k . The problem instances are selected depending on the

number of times the arm has been played so far: at each step, we select the training

problem P ( i )

with i =1+( t k mod N ) .

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home