Information Technology Reference
In-Depth Information
Definition 6.18 (Policy-Following Payoff) Given a discount ʴ , reward function r ,
and derivative policy dp , the policy-following payoff function
ʴ , dp , r
P
: S ₒ R
is
defined by
ʴ , dp , r ( s )
ʔ
P
=
r
·
where ʔ is determined by the discounted weak derivation s
ʴ , dp ʔ .
Lemma 6.15
For any discount ʴ, reward function r , and derivative policy dp , the
ʴ , dp , r
is a fixed point of F ʴ , dp , r .
function
P
We need to show that F ʴ , dp , r (
ʴ , dp , r )( s )
ʴ , dp , r ( s ) holds for any state s .
Proof
P
= P
There are two cases.
ʴ , dp ʔ implies ʔ =
1. If dp ( s )
, then s
s . Therefore,
ʴ , dp , r ( s )
F ʴ , dp , r (
ʴ , dp , r )( s )
P
=
r ( s )
=
P
as required.
2. Suppose dp ( s )
˄
−ₒ
ʴ , dp ʔ then s
ʔ 1 , ʔ 1 ʴ , dp ʔ and
=
ʔ 1 .If s
ʔ =
ʴʔ for some subdistribution ʔ . Therefore,
ʴ , dp , r ( s )
P
ʔ
=
r
·
ʴʔ
=
r
·
=
·
·
ʔ
ʴ
r
ʴ , dp , r ( ʔ 1 )
=
ʴ
· P
F ʴ , dp , r (
ʴ , dp , r )( s )
=
P
Proposition 6.6
[0, 1) be a discount and r a reward function. If dp is a
max-seeking policy with respect to ʴ and r , then
Let ʴ
ʴ , dp , r .
Proof By Lemma 6.13 , the function F ʴ , dp , r has a unique fixed point. By Lemmas 6.14
and 6.15 , both
ʴ , max
P
= P
ʴ , max and
ʴ , dp , r
are fixed points of the same function F ʴ , dp , r , which
P
P
ʴ , max and
ʴ , dp , r
means that
P
P
coincide with each other.
ʔ with ʔ = i = 0 ʔ i
Lemma 6.16
Suppose s
for some properly related
ʔ i
ʴ j } j = 0 be a nondecreasing sequence of discount factors converging to 1 .
Then for any reward function r it holds that
. Let
{
ʔ =
ʴ j ( r
ʔ i ) .
r
·
lim
j ₒ∞
·
i =
0
= ʴ j ( r
· ʔ i
Proof
Let f :
N × N ₒ R
be the function defined by f ( i , j )
). We
check that f satisfies the four conditions in Proposition 4.3.
 
Search WWH ::




Custom Search