Information Technology Reference
In-Depth Information
Instead, we propose a simple approximated solution, which consists in discriminating
formulas by comparing how they rank (in terms of values returned by the formula) a set
of
d
random samples of the variables
r
k
, σ
k
,t
k
,t
. More formally, the procedure is the
following:
1. we first build
Θ
, the space of all formulas
f
such that
length
(
f
)
L
;
2. for
i
=1
...d
, we uniformly draw (within their respective domains) some random
realizations of the variables
r
k
, σ
k
,t
k
,t
that we concatenate into a vector
Θ
i
;
3. we cluster all formulas from
Θ
according to the following rule: two formulas
F
and
F
belong to the same cluster if and only if they rank all the
Θ
i
points in the same
order, i.e.:
≤
F
(
Θ
j
)
.
Formulas leading to invalid index functions (caused for instance by division by zero
or logarithm of negative values) are discarded;
4. among each cluster, we select one formula of minimal length;
5. we gather all the selected minimal length formulas into an approximated reduced
set of formulas
Θ
.
F
(
Θ
i
)
∀
i,j
∈{
1
,...,d
}
,i
=
j, F
(
Θ
i
)
≥
F
(
Θ
j
)
⇐⇒
≥
In the following, we denote by
M
the cardinality of the approximate set of formulas
Θ
=
.
Optimization Algorithm
. A naive approach for finding the best formula
f
∗
∈
{
f
1
,...,f
M
}
Θ
would
Θ
and simply return the best one. While ex-
tremely simple to implement, such an approach could reveal itself to be time-inefficient
in case of spaces
Θ
of large cardinality.
Preliminary experiments have shown us that
Θ
contains a majority of formulas lead-
ing to relatively bad performing index-based policies. It turns out that relatively few
samples of
R
P
(
i
)
,T
are sufficient to reject with high confidence these badly performing
formulas. In order to exploit this idea, a natural idea is to formalize the search for the
best formula as another multi-armed bandit problem. To each formula
F
k
∈
be to evaluate
Δ
(
f
)
for each formula
f
∈
Θ
,weas-
sociate an arm. Pulling the arm
k
consists in selecting a training problem
P
(
i
)
and in
running one episode with the index-based policy whose index formula is
f
k
. This leads
to a reward associated to arm
k
whose value is the quantity
−R
P
(
i
)
,T
observed during
the episode. The purpose of multi-armed bandit algorithms is here to process the se-
quence of observed rewards to select in a smart way the next formula to be tried so that
when the budget of pulls has been exhausted, one (or several) high-quality formula(s)
can be identified.
In the formalization of Equation 7 as a multi-armed bandit problem, only the quality
of the finally suggested arm matters. How to select arms so as to identify the best one
in a finite amount of time is known as the
pure exploration
multi-armed bandit problem
[11]. It has been shown that index-based policies based on upper confidence bounds
were good policies for solving pure exploration bandit problems. Our optimization pro-
cedure works as follows: we use a bandit algorithm such as UCB1-T
UNED
during a
given number of steps and then return the policy that corresponds to the formula
f
k
with highest expected reward
r
k
. The problem instances are selected depending on the
number of times the arm has been played so far: at each step, we select the training
problem
P
(
i
)
with
i
=1+(
t
k
mod
N
)
.