Database Reference
In-Depth Information
that fact that with no acceleration, we still remain in the same state or even pass into
the next highest state in rare cases. The result is therefore plausible.
3.6 Determining an Optimal Solution
The question remains of how to determine an optimal solution to the Bellman
equation, since we neither know its action-value function q π ( s , a ) nor its policy
π ( s , a ). A solution to this is provided by the policy iteration method from dynamic
programming, which in a generalized form can be used as a central tool in RL and
generally for REs.
Policy iteration is based on the following approach: starting with an arbitrary
initial policy
π 0 , the action-value function q i corresponding to the current policy
π i is computed by solving the Bellman equation ( 3.6 ) in every step i ¼ 0,
, n (the
solution method will be described in Sect. 3.9.4 ). After this, we determine a greedy
policy corresponding to the action-value function q i , that is,
...
q i
π 1 ðÞ¼ arg max
a
ð :
s
;
AðÞ
π i +1 is taken to be a policy which in every state s selects one
of the actions a , such as to maximize q i ( s , a ). For
In plain English,
π i +1 , the action-value function q i +1
is then calculated in turn, and so on. This then yields a sequence of policies and
action-value functions:
π 0 ! q 0
! π 1 ! q 1
! π 2 ! q 2
! ::::
It can be shown that after a finite number of iterations, this process terminates with
the optimal policy
π
* and corresponding action-value function q *, which satisfy
q s
q π s
ðÞ¼ max
π
;
ð ,
;
8s
S
, 8a
AðÞ:
Example 3.6 For our example of the car with γ ¼ 0.5 as above, policy iteration
yields - not surprisingly - the optimal policy
*, which stipulates that in each of the
three states, the action a 1 be performed, that is, to always accelerate. The associated
action values are
π
q s 1 ;
q s 2 ;
q s 3 ;
ð
a 1
Þ 182,
ð
a 1
Þ 194,
ð
a 1
Þ 198,
and are thus all greater than those of the non-acceleration policy considered in the
last example, which is incidentally the least successful among all policies.
What will happen if we decrease the reward of the third state? We might, for
instance, be pulled over by the police for exceeding a speed limit. The result for
different r ss 3 values is shown in Fig. 3.5 , where a) shows the case r ss 3 ¼ 100 under
consideration. The lower the value now assigned to this reward, the more unattractive
Search WWH ::




Custom Search