Information Technology Reference
In-Depth Information
Q-Learning uses tables to store data. This quickly loses viability with
increasing levels of complexity of the system it is monitoring/controlling. One
answer to this problem is to use an (adapted) Artificial Neural Network as a
function approximation, as demonstrated by Tesauro in his Backgammon playing
Temporal Difference Learning research. An adaptation of the standard neural
network is required because the required result (from which the error signal is
generated) is itself generated at run-time.
Monte Carlo methods perform a backup for each state based on the entire
sequence of observed rewards from that state until the end of the episode. The
backup of Q-learning, on the other hand, is based on just the next reward, using
the value of the state one step later as a proxy for the remaining rewards
(Bootstrapping method). Thus RL need repeated learning to reach optimal
policies. We construct a ȹ -reward function Rt ƪ by rewriting (10.8) as Equation
(10.16) shown. If the system reaches the end state at T step, the value function
conforms to Equation (10.17). The theoretical meaning of ȹ -reward function is
illustrated in Fig. 10.8.
=
2
T
1
R
r
+
λ
r
+
λ
r
+
+
λ
r
?
(10.16)
t
t
+
t
+
t
+
t
+
T
1
2
3
(
)
(
)
(
)
V
s
V
s
+
α
Ç
R
V
s
×
Ù
(10.17)
É
t
t
t
t
V W
U W
ĂĂ
V W
LU W
V W
L U W
L 7 U W7
V W
V W7
Fig. 10.8. ȹ -reward Function
The TD( ȹ ) algorithm can be understood as one particular way of averaging
n-step backups. According to Equation (10.17), TD( ȹ ) could be designed. In
 
Search WWH ::




Custom Search