Information Technology Reference
In-Depth Information
method and genetic simulated annealing. The selection probability of each action
is related to its Q value:
Q s a
( ,
) /
T
e
p a s
(
|
)
= Ã
(10.13)
Q s a
,
') /
T
e
a
'
10.5 Temporal-Difference Learning
Temporal-difference (TD) learning is a combination of Monte Carlo ideas and
dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can
learn directly from raw experience without a model of the environment's
dynamics. TD resembles a Monte Carlo method because it learns by sampling the
environment according to some policy. TD is related to dynamic programming
techniques because it approximates its current estimate based on previously
learned estimates (a process known as bootstrapping). TD learning algorithm is
related to the Temporal difference model of animal learning.
As a prediction method, TD learning takes into account the fact that
subsequent predictions are often correlated in some sense. In standard supervised
predictive learning, one only learns from actually observed values: a prediction is
made, and when the observation is available, the prediction is adjusted to better
match the observation. TD(0) is the simplest case of temporal-difference learning
as described as follows.
Algorithm 10.1 TD(0) learning algorithm.
Initialize V ( s ) arbitrarily, ʩ to the policy to be evaluated
Repeat (for each episode)
Initialize s
Repeat (for each step of episode)
Choose a from s using policy ʩ derived from V (e.g., ō -greedy)
Take action a , observer r , s ƪ
( )
( )
( )
( )
V
s
V
s
+
α
Ç +
r
γ
V
s
V
s
×
É
Ù
s
s
is terminal
TD(0) learning algorithm contains two steps: determine the new action policy
according to the current value function, and evaluate the action policy by the
Until
s
Search WWH ::




Custom Search