Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

Fig. 3.7 A sequence of

an episode of 2 steps

a t +1

s t +1

s t +2

a t

s t

So while t indicates the step within the episode, that is, along the chain

s 1 ;

a 1

Þ!s 2 ;

a 2

Þ!s 3 ;

a 3

Þ!...

k is the index of the update for a fixed pair ( s t , a t ) throughout all episodes. In order

not to overburden the notation, we leave out the index k and in its place use the

assignment symbol “: ¼ .”

Before we come to the explanation, the first question immediately arises: since,

to carry out an update of the action value q ( s t , a t ) at step t in realtime , we need the

action value q ( s t +1 , a t +1 ) of the next step t+1, how is this supposed to work in

practice? Doesn't this remind you of Baron M¨nchhausen, who escapes from the

swamp by pulling himself up by the hair?

There are simple solutions to this: we can, for instance, wait until step t+1 and

then perform the update ( 3.10 ), that is, always learn with a delay of one step. Or

we may exploit the fact that we determine the actions ourselves via the policy

(provided we are not learning from historical data): at step t , we already know our

next action a t +1 and can thus work with the current q ( s t +1 , a t +1 ) (Fig. 3.7 ).

To continue with the explanation,

α t is the learning parameter at step t .

The higher it is, the faster the algorithm learns. Thus, the current temporal-

difference d t is

d t s t ;

a t ;

s tþ 1 ;

a tþ 1

Þ ¼ r tþ 1 þ γ

qs tþ 1 ;

a tþ 1

Þ qs t ;

a t

ð 3

11 Þ

and ( 3.10 ) takes the following form:

qs t ;

a t

Þ :¼ qs t ;

a t

Þþα t d t s t ;

a t ;

s tþ 1 ;

a tþ 1

Þ:

ð 3

12 Þ

This means that we compute the new estimate

qs t ;

a t

Þ :¼ r tþ 1 þ γ

qs tþ 1 ;

a tþ 1

and subtract the previous iterate q ( s t , a t ) therefrom. If

qs t ;

a t

is greater than

q ( s t , a t ), then the latter is increased in accordance with ( 3.11 ); if

qs t ;

a t

is less

than q ( s t , a t ) , then the latter is decreased in accordance with ( 3.11 ).

So what does e qs t ;

ð Þ mean? We know that q ( s t , a t )is

the expected return taken across the remainder of the episode. The first term r t +1 is

the direct reward of the recommendation a t . The second term

a t

Þ :¼ r tþ 1 þ γ

qs tþ 1 ;

a tþ 1

q ( s t +1 , a t +1 ) is the

expected return from the new state s t +1 . It follows that there are once again two

possibilities for the reason why eqs t ;

að Þ may be higher than q ( s t , a t ): either the direct

reward r t +1 is high or the action a t has led to a valuable state s t +1 with a high action

value q ( s t +1 , a t +1 ) (or both).

Realtime Data Mining

Search WWH ::

Custom Search

Home