Information Technology Reference
In-Depth Information
TD( ȹ ), the value function will be updated by Equation (10.17) through e(s). An
complete algorithm for on-line TD( ȹ ) is given in Algorithm 10.2.
Algorithm 10.2 TD( ȹ ) Algorithm.
Initialize V ( s ) arbitrarily and e ( s )=0 for all s S
Repeat (for each episode)
Initialize s
Repeat (for each step of episode)
a ŕ action given by for s (e.g., ō -greedy)
Take action a , observer r , s ƪ
( )
( )
δ
r
+
γ
V
s
V
s
( )
( )
e s
e s
+
1
for all s
( )
( )
( )
V
s
V
s
+
αδ
e s
( )
( )
e s
γλ
e s
s ŕ s ƪ
Until s is terminal
We could combine the two steps of estimate and evaluation of value function
to construct value function of state-action pair, Q function. In Q-learning, the
learned action-value function Q, directly approximates Q*, the optimal
action-value function, independent of the policy being followed. The policy still
has an effect in that it determines which state-action pairs are visited and updated.
However, all that are required for correct convergence is that all pairs continue to
be updated. This is the minimal requirement in the sense that any method
guaranteed to find optimal behavior in the general case must require it. Under
this assumption and a variant of the usual stochastic approximation conditions on
the sequence of step-size parameters, Q t has been shown to converge with
probability 1 to Q*. The Q-learning algorithm is shown in procedural form in
Algorithm 10.3.
Algorithm 10.3 Q-Learning Algorithm
Initialize Q ( s , a ) arbitrarily
Repeat (for each episode)
Initialize s
Search WWH ::




Custom Search