Finding It Now: Construction and Configuration of Networked Classifiers in Real-Time Stream Mining Systems - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

The allocation of the screening probabilities p i j

aims to maximize the total

expected utility within a certain time:

⎧

⎨

p i j )

maximize

p i j ∈ [ 0 , 1 ]

(

dyn .

(12)

⎩

p i ) ≤ C τ

subject to

τ (

4.4.2

Safe Experimentation

We will benchmark our results on Safe Experimentation algorithms as cited in [ 13 ] .

This low-complexity, model-free learning approach was first proposed for large-

scale, distributed, multi-agent systems, where each agent is unable to observe the

actions of all other agents (due to informational or complexity constraints) and

hence cannot build a model of other agents [ 22 ] . The agent therefore adheres to a

“trusted” action at most times, but occasionally “explores” a different one in search

of a potentially better action.

Safe Experimentation is a reinforcement learning algorithm where each classifier

learns by observing the payoff with which its past actions were rewarded. As such,

it does not consider the interactions between agents, or in our case, the actions of

other autonomous classifiers. In particular, there is no explicit message exchange

among classifiers required (i.e. no vw exchanged), though each classifier needs

to know the reward of its action (and computing this reward might yet require some

form of implicit communication between agents).

The Safe Experimentation algorithm is initialized by selecting an initial order

σ 0 of classifier. C σ 0 ( h + 1 ) will be referred as the “trusted” child of the h th classifier

C σ 0 ( h ) . At each time slot, C σ ( h )

will either forward the stream to its “trusted” child

C σ ( h + 1 )

, will explore a classifier C j

chosen randomly among its children. In the case where a higher reward is achieved

through exploration, C j will become the new “trusted” child of C σ ( h ) .

The name of Safe Experimentation comes from the fact that the global util-

ity achieved by the algorithm is non-decreasing over time. Since utilities are

upper-bounded, this ensures convergence of the Safe Experimentation algorithm.

Furthermore, so long as

with probability

(

− ε )

or, with probability

0, all possible orders will ultimately be explored, such

that Safe Experimentation converges to the optimal order [ 13 ] .

Instead of considering a fixed exploration rate

ε >

, we can consider a dimin-

ishing exploration rate

ε t . In this way, the algorithm will explore largely for

first iterations and focus on exploited orders near convergence.

ε t →

0and

∞

t= 1

N − 1

− ε t

→

0 are sufficient conditions for convergence, typically veri-

(

−

)

1 / n .

Two majors limits of Safe Experimentation can be identified:

fied for

ε t =(

/ t)

Slow Convergence : One iteration of Safe Experimentation is very rapid ( O

since only one order is experienced. However, the expected number of iterations

required to converge to optimal order is bounded below by N ! (corresponding to

(

)

Signal Processing Systems

Search WWH ::

Custom Search

Home