Digital Signal Processing Reference
In-Depth Information
The allocation of the screening probabilities p i j
aims to maximize the total
expected utility within a certain time:
U
p i j )
maximize
p i j [ 0 , 1 ]
(
dyn .
(12)
p i ) C τ
subject to
τ (
4.4.2
Safe Experimentation
We will benchmark our results on Safe Experimentation algorithms as cited in [ 13 ] .
This low-complexity, model-free learning approach was first proposed for large-
scale, distributed, multi-agent systems, where each agent is unable to observe the
actions of all other agents (due to informational or complexity constraints) and
hence cannot build a model of other agents [ 22 ] . The agent therefore adheres to a
“trusted” action at most times, but occasionally “explores” a different one in search
of a potentially better action.
Safe Experimentation is a reinforcement learning algorithm where each classifier
learns by observing the payoff with which its past actions were rewarded. As such,
it does not consider the interactions between agents, or in our case, the actions of
other autonomous classifiers. In particular, there is no explicit message exchange
among classifiers required (i.e. no vw exchanged), though each classifier needs
to know the reward of its action (and computing this reward might yet require some
form of implicit communication between agents).
The Safe Experimentation algorithm is initialized by selecting an initial order
σ 0 of classifier. C σ 0 ( h + 1 ) will be referred as the “trusted” child of the h th classifier
C σ 0 ( h ) . At each time slot, C σ ( h )
will either forward the stream to its “trusted” child
C σ ( h + 1 )
, will explore a classifier C j
chosen randomly among its children. In the case where a higher reward is achieved
through exploration, C j will become the new “trusted” child of C σ ( h ) .
The name of Safe Experimentation comes from the fact that the global util-
ity achieved by the algorithm is non-decreasing over time. Since utilities are
upper-bounded, this ensures convergence of the Safe Experimentation algorithm.
Furthermore, so long as
with probability
(
1
ε )
or, with probability
ε
0, all possible orders will ultimately be explored, such
that Safe Experimentation converges to the optimal order [ 13 ] .
Instead of considering a fixed exploration rate
ε >
ε
, we can consider a dimin-
ishing exploration rate
ε t . In this way, the algorithm will explore largely for
first iterations and focus on exploited orders near convergence.
ε t
0and
1
t= 1
N 1
ε t
0 are sufficient conditions for convergence, typically veri-
(
)
N
1
!
1 / n .
Two majors limits of Safe Experimentation can be identified:
fied for
ε t =(
1
/ t)
￿
Slow Convergence : One iteration of Safe Experimentation is very rapid ( O
),
since only one order is experienced. However, the expected number of iterations
required to converge to optimal order is bounded below by N ! (corresponding to
(
N
)
 
Search WWH ::




Custom Search