Information Technology Reference
In-Depth Information
Definition 6.18 (Policy-Following Payoff)
Given a discount
ʴ
, reward function
r
,
and derivative policy
dp
, the
policy-following payoff function
ʴ
,
dp
,
r
P
:
S
ₒ R
is
defined by
ʴ
,
dp
,
r
(
s
)
ʔ
P
=
r
·
where
ʔ
is determined by the discounted weak derivation
s
⃒
ʴ
,
dp
ʔ
.
Lemma 6.15
For any discount ʴ, reward function
r
, and derivative policy
dp
, the
ʴ
,
dp
,
r
is a fixed point of F
ʴ
,
dp
,
r
.
function
P
We need to show that
F
ʴ
,
dp
,
r
(
ʴ
,
dp
,
r
)(
s
)
ʴ
,
dp
,
r
(
s
) holds for any state
s
.
Proof
P
= P
There are two cases.
⃒
ʴ
,
dp
ʔ
implies
ʔ
=
1. If
dp
(
s
)
↑
, then
s
s
. Therefore,
ʴ
,
dp
,
r
(
s
)
F
ʴ
,
dp
,
r
(
ʴ
,
dp
,
r
)(
s
)
P
=
r
(
s
)
=
P
as required.
2. Suppose
dp
(
s
)
˄
−ₒ
⃒
ʴ
,
dp
ʔ
then
s
ʔ
1
,
ʔ
1
⃒
ʴ
,
dp
ʔ
and
=
ʔ
1
.If
s
ʔ
=
ʴʔ
for some subdistribution
ʔ
. Therefore,
ʴ
,
dp
,
r
(
s
)
P
ʔ
=
r
·
ʴʔ
=
r
·
=
·
·
ʔ
ʴ
r
ʴ
,
dp
,
r
(
ʔ
1
)
=
ʴ
· P
F
ʴ
,
dp
,
r
(
ʴ
,
dp
,
r
)(
s
)
=
P
Proposition 6.6
[0, 1)
be a discount and
r
a reward function. If
dp
is a
max-seeking policy with respect to ʴ and
r
, then
Let ʴ
∈
ʴ
,
dp
,
r
.
Proof
By Lemma
6.13
, the function
F
ʴ
,
dp
,
r
has a unique fixed point. By Lemmas
6.14
and
6.15
, both
ʴ
,
max
P
= P
ʴ
,
max
and
ʴ
,
dp
,
r
are fixed points of the same function
F
ʴ
,
dp
,
r
, which
P
P
ʴ
,
max
and
ʴ
,
dp
,
r
means that
P
P
coincide with each other.
ʔ
with ʔ
=
i
=
0
ʔ
i
Lemma 6.16
Suppose s
⃒
for some properly related
ʔ
i
ʴ
j
}
j
=
0
be a nondecreasing sequence of discount factors converging to
1
.
Then for any reward function
r
it holds that
. Let
{
∞
ʔ
=
ʴ
j
(
r
ʔ
i
)
.
r
·
lim
j
ₒ∞
·
i
=
0
=
ʴ
j
(
r
·
ʔ
i
Proof
Let
f
:
N × N ₒ R
be the function defined by
f
(
i
,
j
)
). We
check that
f
satisfies the four conditions in Proposition 4.3.
Search WWH ::
Custom Search