Database Reference
In-Depth Information
a
b
v
(
s
1
)
q
(
s
,
a
1
)
π
(
s
,
a
)
a
s
r
1
1
p
a
ss
1
q
(
s
,
a
)
q
(
s
,
a
2
)
v
(
s
2
)
v
(
s
)
s
a
ss
r
p
a
ss
π
(
s
,
a
)
(
s
,
a
)
2
2
2
a
ss
n
p
π
(
s
,
a
)
m
a
ss
n
r
v
(
s
)
q
(
s
,
a
)
m
n
Fig. 3.3 Relationship between the state-value and action-value functions v and q
a
b
q
(
s
,
a
)
v
(
s
)
1
11
r
a
ss
1
1
1
π
(
s
,
a
)
1
11
a
ss
p
1
q
(
s
,
a
)
1
v
(
2
s
)
1
12
r
a
ss
a
1
2
s
r
1
π
(
s
,
a
)
p
a
ss
1
s
1
12
2
1
(
s
,
a
)
a
ss
n
1
p
1
1
π
(
s
,
a
)
π
(
s
,
a
)
1
1
m
r
a
ss
n
1
1
q
(
s
,
a
)
v
(
s
)
1
n
1
m
1
1
1
a
ss
p
a
ss
v
(
1
s
)
q
(
s
,
a
)
r
2
1
2
21
1
π
(
s
,
a
)
v
(
s
)
2
21
q
(
s
,
a
)
p
a
ss
2
1
a
ss
p
v
(
2
s
)
q
(
s
,
a
)
2
22
r
a
ss
2
2
a
ss
p
2
2
π
(
s
,
a
)
(
s
,
a
)
2
s
a
ss
2
22
π
(
s
,
a
)
r
2
2
p
a
ss
n
2
s
(
s
,
a
)
2
2
r
a
ss
n
2
π
(
s
,
a
)
2
v
(
s
)
n
2
2
m
2
2
q
(
s
,
a
)
2
2
2
m
2
p
a
ss
n
q
(
s
n
a
,
)
v
(
1
s
)
a
n
1
r
m
π
(
s
,
a
)
ss
π
(
s
n
a
,
)
1
n
1
p
a
ss
m
m
r
a
ss
n
1
q
(
s
n
a
,
)
v
(
2
s
)
n
2
r
a
a
ss
m
p
ss
π
(
s
n
a
,
)
m
2
s
(
s
,
a
)
n
2
2
n
m
a
ss
p
m
n
m
π
(
s
n
a
,
)
r
a
ss
nm
m
v
(
s
)
n
q
(
s
n
a
,
)
n
m
n
nm
m
n
Fig. 3.4 Bellman equation for both the action-value function q and the state-value function v
Since to
q
π
(
s
0
,
a
0
), too, a Bellman equation in accordance with (
3.6
) applies, with
new subsequent states
s
00
and actions a
00
(which in part can contain the original
s
and
a
!), the solution of (
3.6
) - unlike the 1-step special case (
3.5
) - is usually a more
complex undertaking. This reflects the fact that we are taking into account the entire
chain of subsequent transactions by which we address the Problem 4 in Chap.
2
.
Since our transition probabilities
p
ss
0
in fact depend on the action
a
, we learn directly
from this and thus also solve Problem 1 in Chap.
2
.
For the sake of completeness, we should also mention that in (
3.4
), we can
conversely eliminate the action-value function
q
π
. We then obtain the Bellman
equation for the state-value function (Fig.
3.4b
):
v
π
ðÞ¼
X
h
v
π
s
0
i
ð
X
s
0
p
ss
0
r
ss
0
þ γ
a
π
s
;
:
ð
3
:
7
Þ