Database Reference
In-Depth Information
By analogy with Fig. 1.1 , we see that while the update of the action-value
function q π ( s , a ) corresponds to the analysis, the policy
( s ) determines the selec-
tion of the correct action. The action-value function thus constitutes our analysis
model, and updating it is what we call learning.
We describe firstly the implementation of the policy for an action-value
function, that is, the selection of the action.
π
3.3
Implementing the Policy: Selecting the Actions
The basic policy for an action-value function is what we call the greedy policy ,
which in every state s selects the action a which is the one that maximizes q π ( s , a ):
q π s
π
ðÞ¼ arg max
a AðÞ
ð :
;
So in every state, the action with the largest action value is selected, or if there
are several having that value, then one of them. This is the most obvious selection.
In order, however, to avoid always restricting ourselves to exploiting existing
knowledge but rather to allow new actions to be explored , in addition to the
deterministic greedy policy, stochastic policies are also used. A stochastic policy
π
A ( s ) the probability of
selection of a . So while in every state the deterministic policy always makes a
unique selection of the action, the stochastic policy permits the selection of
different actions with specified probabilities.
In the simplest case of a stochastic policy, at most steps, we select the best action
(greedy policy), but from time to time - that is, with the probability
( s , a ) specifies for every state s and every action a
ε
- we select an
action a
- greedy policy .
The combination of exploitation and exploration can also be performed on a
sliding basis, that is, the frequency of selection of an action increases with its action
value. This is done by means of the softmax policy . For this, the recommendations
are calculated at every step in accordance with a probability distribution such as the
Boltzmann distribution:
A ( s ) at random. We call the resulting policy the
ε
e qs ;ðÞ
τ
π
ðÞ¼
s
;
X
qs
;
ðÞ
τ
e
b AðÞ
where
asymptotically leads to an
even distribution of all actions (exploration); a low value for
τ
is the “temperature parameter.” A high value for
τ
leads to selection of the
best actions (exploitation). In general, the softmax policy leads to better results than
the
τ
-greedy policy, but both in theory and in practice, it is more difficult to handle.
The correct interplay between exploitation and exploration is one of the central
issues in RL. Here again, chess provides a useful example: in most positions, we
ε
Search WWH ::




Custom Search