Database Reference
In-Depth Information
Fig. 3.7 A sequence of
an episode of 2 steps
a
t
+1
s
t
+1
s
t
+2
a
t
s
t
So while
t
indicates the step within the episode, that is, along the chain
ð
s
1
;
a
1
Þ!s
2
;
ð
a
2
Þ!s
3
;
ð
a
3
Þ!...
k
is the index of the update for a fixed pair (
s
t
,
a
t
) throughout all episodes. In order
not to overburden the notation, we leave out the index
k
and in its place use the
assignment symbol “:
¼
.”
Before we come to the explanation, the first question immediately arises: since,
to carry out an update of the action value
q
(
s
t
,
a
t
) at step
t
in realtime
,
we need the
action value
q
(
s
t
+1
,
a
t
+1
) of the next step
t+1,
how is this supposed to work in
practice? Doesn't this remind you of Baron M¨nchhausen, who escapes from the
swamp by pulling himself up by the hair?
There are simple solutions to this: we can, for instance, wait until step
t+1
and
then perform the update (
3.10
), that is, always learn with a delay of one step. Or
we may exploit the fact that we determine the actions ourselves via the policy
(provided we are not learning from historical data): at step
t
, we already know our
next action
a
t
+1
and can thus work with the current
q
(
s
t
+1
,
a
t
+1
) (Fig.
3.7
).
To continue with the explanation,
α
t
is the learning parameter at step
t
.
The higher it is, the faster the algorithm learns. Thus, the current
temporal-
difference d
t
is
d
t
s
t
;
ð
a
t
;
s
tþ
1
;
a
tþ
1
Þ ¼ r
tþ
1
þ γ
qs
tþ
1
;
ð
a
tþ
1
Þ qs
t
;
ð
a
t
Þ
ð
3
:
11
Þ
and (
3.10
) takes the following form:
qs
t
;
ð
a
t
Þ :¼ qs
t
;
ð
a
t
Þþα
t
d
t
s
t
;
ð
a
t
;
s
tþ
1
;
a
tþ
1
Þ:
ð
3
:
12
Þ
e
This means that we compute the new estimate
qs
t
;
ð
a
t
Þ :¼ r
tþ
1
þ γ
qs
tþ
1
;
ð
a
tþ
1
Þ
e
and subtract the previous iterate
q
(
s
t
,
a
t
) therefrom. If
qs
t
;
ð
a
t
Þ
is greater than
e
q
(
s
t
,
a
t
), then the latter is increased in accordance with (
3.11
); if
qs
t
;
ð
a
t
Þ
is less
than
q
(
s
t
,
a
t
)
,
then the latter is decreased in accordance with (
3.11
).
So what does
e
qs
t
;
ð Þ
mean? We know that
q
(
s
t
,
a
t
)is
the expected return taken across the remainder of the episode. The first term
r
t
+1
is
the direct reward of the recommendation
a
t
. The second term
ð
a
t
Þ :¼ r
tþ
1
þ γ
qs
tþ
1
;
a
tþ
1
q
(
s
t
+1
,
a
t
+1
) is the
expected return from the new state
s
t
+1
.
It follows that there are once again two
possibilities for the reason why
eqs
t
;
γ
að Þ
may be higher than
q
(
s
t
,
a
t
): either the direct
reward
r
t
+1
is high or the action
a
t
has led to a valuable state
s
t
+1
with a high action
value
q
(
s
t
+1
,
a
t
+1
) (or both).