Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

By analogy with Fig. 1.1 , we see that while the update of the action-value

function q π ( s , a ) corresponds to the analysis, the policy

( s ) determines the selec-

tion of the correct action. The action-value function thus constitutes our analysis

model, and updating it is what we call learning.

We describe firstly the implementation of the policy for an action-value

function, that is, the selection of the action.

π

3.3

Implementing the Policy: Selecting the Actions

The basic policy for an action-value function is what we call the greedy policy ,

which in every state s selects the action a which is the one that maximizes q π ( s , a ):

q π s

π

ðÞ¼ arg max

a ∈ AðÞ

ð :

;

So in every state, the action with the largest action value is selected, or if there

are several having that value, then one of them. This is the most obvious selection.

In order, however, to avoid always restricting ourselves to exploiting existing

knowledge but rather to allow new actions to be explored , in addition to the

deterministic greedy policy, stochastic policies are also used. A stochastic policy

π

A ( s ) the probability of

selection of a . So while in every state the deterministic policy always makes a

unique selection of the action, the stochastic policy permits the selection of

different actions with specified probabilities.

In the simplest case of a stochastic policy, at most steps, we select the best action

(greedy policy), but from time to time - that is, with the probability

( s , a ) specifies for every state s and every action a

∈

ε

- we select an

action a

- greedy policy .

The combination of exploitation and exploration can also be performed on a

sliding basis, that is, the frequency of selection of an action increases with its action

value. This is done by means of the softmax policy . For this, the recommendations

are calculated at every step in accordance with a probability distribution such as the

Boltzmann distribution:

A ( s ) at random. We call the resulting policy the

ε

∈

e qs ;ðÞ

τ

π

ðÞ¼

s

;

X

qs

;

ðÞ

τ

e

b ∈ AðÞ

where

asymptotically leads to an

even distribution of all actions (exploration); a low value for

τ

is the “temperature parameter.” A high value for

τ

leads to selection of the

best actions (exploitation). In general, the softmax policy leads to better results than

the

τ

-greedy policy, but both in theory and in practice, it is more difficult to handle.

The correct interplay between exploitation and exploration is one of the central

issues in RL. Here again, chess provides a useful example: in most positions, we

ε

Realtime Data Mining

Search WWH ::

Custom Search

Home