Information Technology Reference
In-Depth Information
Lemma 6.12 Given a finitely branching pLTS, discount ʴ and reward function r ,
there always exists a max-seeking policy.
ʴ , r
Proof
max ( s )
can be calculated for each state s . Then we can define a derivative policy dp in the
following way. For any state s ,if r ( s )
Given a pLTS, discount ʴ and reward function r , the discounted payoff
P
˄
−ₒ
ʴ , max ( ʔ 1 ) for all s
ʴ
· P
ʔ 1 , then we
˄
−ₒ
set dp undefined at s . Otherwise, we choose a transition s
ʔ among the finite
ʴ , max ( ʔ )
ʴ , max ( ʔ 1 ) for all other
number of outgoing transitions from s such that
P
≥ P
˄
−ₒ
transitions s
ʔ .
Given a pLTS, discount ʴ , reward function r , and derivative policy dp , we define
the function F ʴ , dp , r
ʔ 1 , and we set dp ( s )
=
ₒ R
ₒ R
:( S
)
( S
) by letting
r ( s )
if dp ( s )
F ʴ , dp , r ( f )( s )
=
(6.16)
ʴ
·
f ( ʔ ) f dp ( s )
=
ʔ
= s ʔ
where f ( ʔ )
ʔ ( s )
·
f ( s ).
Lemma 6.13 Given a pLTS, discount ʴ< 1 , reward function r , and derivative
policy dp , the function F ʴ , dp , r
has a unique fixed point.
We first show that the function F ʴ , dp , r
Proof
is a contraction mapping. Let f , g be
any two functions of type S
ₒ R
.
F ʴ , dp , r ( f )
F ʴ , dp , r ( g )
|
|
F ʴ , dp , r ( f )( s )
F ʴ , dp , r ( g )( s )
=
sup
{|
||
s
S
}
F ʴ , dp , r ( f )( s )
F ʴ , dp , r ( g )( s )
=
sup
{|
||
s
S and dp ( s )
↓}
= ʴ ·
sup
{| f ( ʔ )
g ( ʔ )
|| s S and dp ( s )
= ʔ for some ʔ }
f ( s )
g ( s )
s
ʴ
·
sup
{|
||
S
}
=
ʴ
·|
f
g
|
<
|
f
g
|
By the Banach unique fixed-point theorem (Theorem 2.8), the function F ʴ , dp , r has a
unique fixed point.
Lemma 6.14 Given a pLTS, discount ʴ, reward function r , and max-seeking policy
dp , the function
ʴ , max is a fixed point of F ʴ , dp , r .
P
We need to show that F ʴ , dp , r (
P
ʴ , max )( s )
= P
ʴ , max ( s ) holds for any state s .We
Proof
distinguish two cases.
1. If dp ( s )
, then F ʴ , dp , r (
P
ʴ , max )( s )
=
r ( s )
= P
ʴ , max ( s ) as expected.
2. I f dp ( s )
= ʔ , then the arguments are more involved. First we observe that if
s ʴ ʔ , then by Definition 6.14 there exist some ʔ 0
, ʔ 0 , ʔ 1 , ʔ such that
˄
−ₒ
ʔ 0 +
ʔ 0 , ʔ 0
ʔ 1 , ʔ 1 ʴ ʔ and ʔ =
ʔ 0 +
ʔ . So we can do
s
=
ʴ
·
the following calculation.
 
Search WWH ::




Custom Search