Civil Engineering Reference
In-Depth Information
n> 1 and could extend across multiple episodes. After adding the momentum term,
Eqs. 2.7 and 2.8 become, respectively:
ʻ t k f v ( k )
j
y ( k )
h
t
ʱ r ( t + 1)
V ( t )
w ( t )
jh
ʷw ( t 1)
jh
ʳV ( t + 1)
=
+
+
(2.9)
k
=
0
ʻ t k f v ( k )
j
w ( t )
jh f v ( k )
x ( k )
i
t
+ ʱ r ( t + 1)
V ( t )
w ( t )
hi
= ʷw ( t 1)
hi
+ ʳV ( t + 1)
h
k = 0
(2.10)
There are a few important notes about the above two equations. The error term,
the difference between subsequent state values V ( t + 1) and V ( t ) (neglecting the reward
term for now), is essentially the information (i.e., feedback) that is used to update
network weights. Drawing from terminology of the back-propagation algorithm,
V ( t + 1) can be considered the target output value y j , and V ( t ) can be considered the
predicted output value y j . The network error is therefore based on the next state value
V ( t + 1) , in spite of the fact that this value is merely an estimate and may actually be
quite different than the true state value. When weight updates become small, it is
possible that numerical errors could affect the final network weights. However, it is
likely that the error between the approximated value function by the neural network
and the true (unobserved) value function will dominate any numerical error, and thus
numerical error is not a large concern.
The general form of a sequential decision making processes proceeds over t =
0, 1, ... , T . For all intermediate (i.e., non-terminal, t = T ) time steps, the next-state
values V ( t + 1) are available based on their predicted value after pursuing an action;
an associated reward r ( t + 1) may also be provided at these time steps as well, and
the incorporation of rewards at intermediate time points is dependent on the specific
problem. For example, some problems have well-defined subgoals that must be
achieved en route to an ultimate goal. In such cases, non-zero reward values may be
provided when these subgoals are achieved. In other problems such as board games,
there are often no such subgoals, and thus the reward at all intermediate time steps
t
1 is 0, and the temporal difference error is then only based on the
difference between subsequent state values (i.e., ʳV ( t + 1)
=
0, 1, ... , T
V ( t ) ).
T , there is no next-state value V ( t + 1) , and this value
is set to 0. The reward at this time step r ( t + 1) is non-zero however, and the temporal
difference error is then based on the previous state value and the reward value (i.e.,
r ( t + 1)
At the terminal time step t
=
V ( t ) ). For problems in which there are no intermediate rewards, the reward
provided at time step T is the only information that is known with complete certainty,
and thus this is the only true information from which the neural network can learn
the values of all previously visited states.
Equations 2.9 and 2.10 require that information (terms within the summation)
from all previous states x (0) , ... , x ( t 1) be used to determine the appropriate weight
adjustment at state x ( t ) . At time step t , information from all previous states is merely
discounted by ʻ . This can be exploited to reduce the number of computations required
Search WWH ::




Custom Search