Database Reference
In-Depth Information
s :¼ s 2 ; ...;
ð
s l
Þ:
Then the transition probabilities of M are stipulated as
p s , s 0 ,
s 0
¼ s ,
p ss 0 ;s ðÞ
otherwise, a
A
:
0,
Similarly, the rewards are taken to be
s 0
r s, s 0 ,
¼ s ,
r ss 0 ;s ðÞ
e
otherwise, a
A
:
0,
s j j N S be a trajectory of M and define
Now let
s j j N
j N ,
! S N
S N
g
:
:
s jkþ 1 ; ...;
s j
where we made use of the convention
:¼ s 1 ; ...;
, j k þ 1
s jkþ 1 ; ...;
s j
s j
<
1
:
Conversely, consider
j N
s j j N :
: S N
! S N
s j
h
:
;
s j
g ¼ id, i.e., the identical mapping, and all trajectories of S
not contained in h ( S N ) have vanishing probability. Furthermore, any trajectory of
M has the same probability and reward sequence as its image under h .
By virtue of this result, it is straightforward to verify that the state-value function
v of a given policy
Then we have h
π
satisfies the Bellman equation
j
v s 2 , ... , s l s 0 s 0
k
v s 1 , ... , s l ¼ X
a
Þ X p s 1 , ... , s l s 0
r s 1 , ... , s l s 0 þ γ
A π
ð
s 1 ; ...;
s l
ð 10
:
1 Þ
Also by means of state space augmentation, we may devise a k -MDP generaliza-
tion of temporal-difference learning. Given a transition from state s to s 0 given the
history s ¼ ( s 1 ,
...
, s l 1 ), the update rule reads as
v
:¼ v þ α
zd ðÞ ,
ð 10
:
2 Þ
Where
, z
dðÞ:¼ r s , s 0
v s ;ðÞ γ
v s ;s ðÞ
:¼ λγ
z þ e s :
ð 10
:
3 Þ
Search WWH ::




Custom Search