Reinforcement Learning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

method and genetic simulated annealing. The selection probability of each action

is related to its Q value:

Q s a

( ,

) /

T

e

p a s

(

|

)

= Ã

(10.13)

Q s a

,

') /

T

e

a

'

10.5 Temporal-Difference Learning

Temporal-difference (TD) learning is a combination of Monte Carlo ideas and

dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can

learn directly from raw experience without a model of the environment's

dynamics. TD resembles a Monte Carlo method because it learns by sampling the

environment according to some policy. TD is related to dynamic programming

techniques because it approximates its current estimate based on previously

learned estimates (a process known as bootstrapping). TD learning algorithm is

related to the Temporal difference model of animal learning.

As a prediction method, TD learning takes into account the fact that

subsequent predictions are often correlated in some sense. In standard supervised

predictive learning, one only learns from actually observed values: a prediction is

made, and when the observation is available, the prediction is adjusted to better

match the observation. TD(0) is the simplest case of temporal-difference learning

as described as follows.

Algorithm 10.1 TD(0) learning algorithm.

Initialize V ( s ) arbitrarily, ʩ to the policy to be evaluated

Repeat (for each episode)

Initialize s

Repeat (for each step of episode)

Choose a from s using policy ʩ derived from V (e.g., ō -greedy)

Take action a , observer r , s ƪ

( )

V

s

←

V

s

+

α

Ç +

r

γ

V

s

′

−

V

s

×

É

Ù

s

←

s

′

is terminal

TD(0) learning algorithm contains two steps: determine the new action policy

according to the current value function, and evaluate the action policy by the

Until

s

Advanced Artificial Intelligence

Search WWH ::

Custom Search

Home