Reinforcement Learning - Advanced Artificial Intelligence - page 379

Information Technology Reference

In-Depth Information

Q-Learning uses tables to store data. This quickly loses viability with

increasing levels of complexity of the system it is monitoring/controlling. One

answer to this problem is to use an (adapted) Artificial Neural Network as a

function approximation, as demonstrated by Tesauro in his Backgammon playing

Temporal Difference Learning research. An adaptation of the standard neural

network is required because the required result (from which the error signal is

generated) is itself generated at run-time.

Monte Carlo methods perform a backup for each state based on the entire

sequence of observed rewards from that state until the end of the episode. The

backup of Q-learning, on the other hand, is based on just the next reward, using

the value of the state one step later as a proxy for the remaining rewards

(Bootstrapping method). Thus RL need repeated learning to reach optimal

policies. We construct a ȹ -reward function Rt ƪ by rewriting (10.8) as Equation

(10.16) shown. If the system reaches the end state at T step, the value function

conforms to Equation (10.17). The theoretical meaning of ȹ -reward function is

illustrated in Fig. 10.8.

′ =

2

T

−

1

R

r

+

λ

r

+

λ

r

+

+

λ

r

?

(10.16)

t

t

+

t

+

t

+

t

+

T

1

2

3

(

)

(

)

(

)

′

V

s

←

V

s

+

α

Ç

R

−

V

s

×

Ù

(10.17)

É

t

t

t

t

V W

U W

ĂĂ

V W

LU W

V W

L U W

L 7 U W7

V W

V W7

Fig. 10.8. ȹ -reward Function

The TD( ȹ ) algorithm can be understood as one particular way of averaging

n-step backups. According to Equation (10.17), TD( ȹ ) could be designed. In

Next Page

Advanced Artificial Intelligence

Search WWH ::

Custom Search

Home