Database Reference
In-Depth Information
Algorithm 3.4: Sarsa(
)
Input: online rewards r and -transitions s 0 , step-size
λ
α
, discount rate
γ
, eligibility
trace parameter
λ
Output: optimal action-value function q *
1: initialize arbitrarily q ( s , a ), z ( s , a ) ¼ 0
8 s
S , 8 a
A ( s )
2: repeat for each episode
3:
initialize s , a
repeat for each step of episode
4:
take action a , observe r , s 0
5:
choose a 0 from s 0 using policy derived from q (e.g.
6:
ε
-greedy)
q ( s 0 , a 0 ) q ( s , a )
7:
d : ¼ r + γ
8:
z ( s , a ): ¼ z ( s , a )+1
9:
for all s, a do
10:
q ( s , a ): ¼ q ( s , a )+
α
dz ( s , a )
11:
z ( s , a ): ¼ γλ
z ( s , a )
12: end for
13: until s is terminal
14: until stop
3.10 Summary
In this chapter, we gave a short introduction to reinforcement learning. We have
seen that RL addresses the problems from Chap. 2 . This shows that in principle, RL
is a suitable tool for solving all of the four problems described in Chap. 2 .
Furthermore, in addition to online learning, we had previously suggested offline
learning by policy iteration for solving the Bellman equation. Both approaches are
linked consistently via the action-value and state-value functions. We can, for
instance, calculate the action-value function offline, using historical data, and
then update it online. In this way, RL is also a very nice example of the link
between the two types of learning, in accordance with Rule III in Chap. 1 .
We now return to reinforcement learning and consider its application to recom-
mendation engines.
Search WWH ::




Custom Search