Database Reference
In-Depth Information
s
:¼ s
2
; ...;
ð
s
l
Þ:
Then the transition probabilities of
M
are stipulated as
p
s
,
s
0
,
s
0
¼
s
,
p
ss
0
;s
ðÞ
:¼
otherwise,
a
A
:
∈
0,
Similarly, the rewards are taken to be
s
0
r
s,
s
0
,
¼
s
,
r
ss
0
;s
ðÞ
:¼
e
otherwise,
a
A
:
∈
0,
s
j
j
∈
N
S
be a trajectory of
M
and define
Now let
s
j
j
∈
N
j
∈
N
,
! S
N
S
N
g
:
:
s
jkþ
1
; ...;
s
j
where we made use of the convention
:¼ s
1
; ...;
,
j k þ
1
s
jkþ
1
; ...;
s
j
s
j
<
1
:
Conversely, consider
j
∈
N
s
j
j
∈
N
:
: S
N
! S
N
s
j
h
:
;
s
j
g ¼
id, i.e., the identical mapping, and all trajectories of
S
not contained in
h
(
S
N
) have vanishing probability. Furthermore, any trajectory of
M
has the same probability and reward sequence as its image under
h
.
By virtue of this result, it is straightforward to verify that the state-value function
v
of a given policy
Then we have
h
∘
π
satisfies the Bellman equation
j
v
s
2
,
...
,
s
l
s
0
s
0
k
v
s
1
,
...
,
s
l
¼
X
a
Þ
X
p
s
1
,
...
,
s
l
s
0
r
s
1
,
...
,
s
l
s
0
þ γ
A
π
ð
s
1
; ...;
s
l
ð
10
:
1
Þ
∈
Also by means of state space augmentation, we may devise a
k
-MDP generaliza-
tion of temporal-difference learning. Given a transition from state
s
to
s
0
given the
history s
¼
(
s
1
,
...
,
s
l
1
), the update rule reads as
v
:¼ v þ α
zd ðÞ
,
ð
10
:
2
Þ
Where
,
z
dðÞ:¼ r
s
;ð
,
s
0
v
s
;ðÞ
γ
v
s
;s
ðÞ
:¼ λγ
z þ e
s
;ð
:
ð
10
:
3
Þ