Information Technology Reference
In-Depth Information
˄
s
˄
1/2
1/2
s 1
Fig. 6.3 Max-seeking policies
derivative policies rather than extreme policies makes the details considerably differ-
ent. Consequently, the proof is spelled out in some detail in Sect. 6.5.2 , cumulating
in Theorem 6.10 .
Example 6.10 Let us say that a derivative policy dp is max-seeking with respect to
a reward function r if for all s
S the following requirements are met.
˄
−ₒ ʔ 1 .
r
1. If dp ( s )
then r ( s )
≥ P
max ( ʔ 1 ) for all s
2. If dp ( s )
= ʔ then
r
a)
P
max ( ʔ )
r ( s ) and
˄
−ₒ
r
r
ʔ 1 .
What a max-seeking policy does is to evaluate
b)
P
max ( ʔ )
≥ P
max ( ʔ 1 ) for all s
r
max in advance, for a given reward
function r , and then label each state s with the payoff value
P
r max ( s ). The policy
at any state s is then to compare r ( s ) with the expected label values
P
r max ( ʔ ) (i.e.
P
˄
−ₒ
ʔ and then to select the greatest
among all those values. Note that for the policy to be well defined, we require that
the pLTS under consideration is finitely branching.
In case that seems obvious, we now consider the pLTS in Fig. 6.3 and let us
apply the above definition of max-seeking policies to the reward function given by
r ( s 0 )
r max )) for each outgoing transition s
Exp ʔ (
P
=
0,
r ( s 1 )
=
1. For both states a payoff of 1 i s attainable ev entually, thus
r max ( s 0 )
r max ( s 1 )
P
= P
=
1, because we have s 0
s 1 and s 1
s 1 . Hence, both
r max -labelled with 1. At state s 0 the policy then makes a choice among
three options: (1) to stay unmoved, yielding immediate payoff r ( s 0 )
states will be
P
=
0; (2) to take
˄
−ₒ
˄
−ₒ
the transition s 0
s 1 . Clearly one of
the latter two is chosen — but which? If it is the second, then indeed the maximum
payoff 1 can be achieved. If it is the first, then in fact the overall payoff will be 0
because of divergence, so the policy would fail to attain the maximum payoff 1.
However, for properly discounted max-seeking policies, we show in Proposi-
tion 6.6 that they always attain the maximum payoffs.
With Theorem 6.8 at hand, we are in the position to prove the main result of this
section, which says that in a finitary pLTS the set of weak derivatives from any state
s 0 ; (3) to take the transition s 0
s 01 / 2
 
Search WWH ::




Custom Search