Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

Algorithm 3.4: Sarsa(

)

Input: online rewards r and -transitions s 0 , step-size

λ

α

, discount rate

γ

, eligibility

trace parameter

λ

Output: optimal action-value function q *

1: initialize arbitrarily q ( s , a ), z ( s , a ) ¼ 0

8 s

S , 8 a

A ( s )

∈

2: repeat for each episode

3:

initialize s , a

repeat for each step of episode

4:

take action a , observe r , s 0

5:

choose a 0 from s 0 using policy derived from q (e.g.

6:

ε

-greedy)

q ( s 0 , a 0 ) q ( s , a )

7:

d : ¼ r + γ

8:

z ( s , a ): ¼ z ( s , a )+1

9:

for all s, a do

10:

q ( s , a ): ¼ q ( s , a )+

α

dz ( s , a )

11:

z ( s , a ): ¼ γλ

z ( s , a )

12: end for

13: until s is terminal

14: until stop

3.10 Summary

In this chapter, we gave a short introduction to reinforcement learning. We have

seen that RL addresses the problems from Chap. 2 . This shows that in principle, RL

is a suitable tool for solving all of the four problems described in Chap. 2 .

Furthermore, in addition to online learning, we had previously suggested offline

learning by policy iteration for solving the Bellman equation. Both approaches are

linked consistently via the action-value and state-value functions. We can, for

instance, calculate the action-value function offline, using historical data, and

then update it online. In this way, RL is also a very nice example of the link

between the two types of learning, in accordance with Rule III in Chap. 1 .

We now return to reinforcement learning and consider its application to recom-

mendation engines.

Realtime Data Mining

Search WWH ::

Custom Search

Home