Information Technology Reference
In-Depth Information
Lemma 6.12
Given a finitely branching pLTS, discount ʴ and reward function
r
,
there always exists a max-seeking policy.
ʴ
,
r
Proof
max
(
s
)
can be calculated for each state
s
. Then we can define a derivative policy
dp
in the
following way. For any state
s
,if
r
(
s
)
Given a pLTS, discount
ʴ
and reward function
r
, the discounted payoff
P
˄
−ₒ
ʴ
,
max
(
ʔ
1
) for all
s
≥
ʴ
· P
ʔ
1
, then we
˄
−ₒ
set
dp
undefined at
s
. Otherwise, we choose a transition
s
ʔ
among the finite
ʴ
,
max
(
ʔ
)
ʴ
,
max
(
ʔ
1
) for all other
number of outgoing transitions from
s
such that
P
≥ P
˄
−ₒ
transitions
s
ʔ
.
Given a pLTS, discount
ʴ
, reward function
r
, and derivative policy
dp
, we define
the function
F
ʴ
,
dp
,
r
ʔ
1
, and we set
dp
(
s
)
=
ₒ R
ₒ
ₒ R
:(
S
)
(
S
) by letting
⊧
⊨
r
(
s
)
if
dp
(
s
)
↑
F
ʴ
,
dp
,
r
(
f
)(
s
)
=
(6.16)
⊩
ʴ
·
f
(
ʔ
) f
dp
(
s
)
=
ʔ
=
s
∈
ʔ
where
f
(
ʔ
)
ʔ
(
s
)
·
f
(
s
).
Lemma 6.13
Given a pLTS, discount ʴ<
1
, reward function
r
, and derivative
policy
dp
, the function F
ʴ
,
dp
,
r
has a unique fixed point.
We first show that the function
F
ʴ
,
dp
,
r
Proof
is a contraction mapping. Let
f
,
g
be
any two functions of type
S
ₒ R
.
F
ʴ
,
dp
,
r
(
f
)
F
ʴ
,
dp
,
r
(
g
)
|
−
|
F
ʴ
,
dp
,
r
(
f
)(
s
)
F
ʴ
,
dp
,
r
(
g
)(
s
)
=
sup
{|
−
||
s
∈
S
}
F
ʴ
,
dp
,
r
(
f
)(
s
)
F
ʴ
,
dp
,
r
(
g
)(
s
)
=
sup
{|
−
||
s
∈
S
and
dp
(
s
)
↓}
=
ʴ
·
sup
{|
f
(
ʔ
)
−
g
(
ʔ
)
||
s
∈
S
and
dp
(
s
)
=
ʔ
for some
ʔ
}
f
(
s
)
g
(
s
)
s
∈
≤
ʴ
·
sup
{|
−
||
S
}
=
ʴ
·|
f
−
g
|
<
|
f
−
g
|
By the Banach unique fixed-point theorem (Theorem 2.8), the function
F
ʴ
,
dp
,
r
has a
unique fixed point.
Lemma 6.14
Given a pLTS, discount ʴ, reward function
r
, and max-seeking policy
dp
, the function
ʴ
,
max
is a fixed point of F
ʴ
,
dp
,
r
.
P
We need to show that
F
ʴ
,
dp
,
r
(
P
ʴ
,
max
)(
s
)
= P
ʴ
,
max
(
s
) holds for any state
s
.We
Proof
distinguish two cases.
1. If
dp
(
s
)
, then
F
ʴ
,
dp
,
r
(
↑
P
ʴ
,
max
)(
s
)
=
r
(
s
)
= P
ʴ
,
max
(
s
) as expected.
2.
I
f
dp
(
s
)
=
ʔ
, then the arguments are more involved. First we observe that if
s
⃒
ʴ
ʔ
, then by Definition
6.14
there exist some
ʔ
0
,
ʔ
0
,
ʔ
1
,
ʔ
such that
˄
−ₒ
ʔ
0
+
ʔ
0
,
ʔ
0
ʔ
1
,
ʔ
1
⃒
ʴ
ʔ
and
ʔ
=
ʔ
0
+
ʔ
. So we can do
s
=
ʴ
·
the following calculation.
Search WWH ::
Custom Search