Database Reference
In-Depth Information
Algorithm 3.4: Sarsa(
)
Input: online rewards
r
and -transitions
s
0
, step-size
λ
α
, discount rate
γ
, eligibility
trace parameter
λ
Output: optimal action-value function
q
*
1: initialize arbitrarily
q
(
s
,
a
),
z
(
s
,
a
)
¼
0
8 s
S
,
8 a
A
(
s
)
∈
∈
2: repeat
for each episode
3:
initialize
s
,
a
repeat
for each step of episode
4:
take action
a
, observe
r
,
s
0
5:
choose
a
0
from
s
0
using policy derived from
q
(e.g.
6:
ε
-greedy)
q
(
s
0
,
a
0
)
q
(
s
,
a
)
7:
d
:
¼ r
+
γ
8:
z
(
s
,
a
):
¼ z
(
s
,
a
)+1
9:
for
all s, a
do
10:
q
(
s
,
a
):
¼ q
(
s
,
a
)+
α
dz
(
s
,
a
)
11:
z
(
s
,
a
):
¼ γλ
z
(
s
,
a
)
12: end for
13: until
s is terminal
14: until
stop
3.10 Summary
In this chapter, we gave a short introduction to reinforcement learning. We have
seen that RL addresses the problems from Chap.
2
. This shows that in principle, RL
is a suitable tool for solving all of the four problems described in Chap.
2
.
Furthermore, in addition to online learning, we had previously suggested offline
learning by policy iteration for solving the Bellman equation. Both approaches are
linked consistently via the action-value and state-value functions. We can, for
instance, calculate the action-value function offline, using historical data, and
then update it online. In this way, RL is also a very nice example of the link
We now return to reinforcement learning and consider its application to recom-
mendation engines.