Civil Engineering Reference
In-Depth Information
state values at the current and subsequent time steps, respectively. More explicitly,
we can write x ( t + 1)
= ˀ ( x ( t ) ), which indicates that the state x ( t + 1) is a function
of the -greedy action selection policy ˀ , as is used in this work. More generally
though, any action selection policy ˀ may be used, in which case we could simply
write x ( t + 1)
ˀ ( x ( t ) ). An -greedy action selection policy chooses the action that
results in the next-state x ( t + 1) with the greatest value 100 % of the time (where
ranges over [0, 1]). In the other 100(1
=
) % of the time, random actions are chosen
regardless of the values of the next-states. Thus, this policy allows for the agent
to exploit its knowledge 100 % of the time, but also to explore potentially better
actions 100(1
) % of the time.
The values of V ( t ) and V ( t + 1) are determined by evaluating the respective state
values ( x ( t ) and x ( t + 1) ) through the neural network using forward propagation, where
the next-state vector x ( t + 1) is determined based on an action selection procedure and
the dynamics of the domain. This expression for the temporal difference error also
discounts the subsequent state value V ( t + 1) by a factor ʳ , which serves to attenuate
the value that the network is attempting to learn. Note that the temporal difference
algorithm gets its name from this error expression, as it is based on the difference in
the predicted state values at two different time steps.
The general form of the TD( ʻ ) algorithm can be more explicitly written for up-
dating the network weights such that w
+
w . The weight updates between
nodes in the output layer j and nodes in the hidden layer h ( w ( t )
w
jh ) at time step t can
be stated as:
ʻ t k f v ( k )
j
y ( k )
h
t
ʱ r ( t + 1)
V ( t )
w ( t )
ʳV ( t + 1)
jh =
+
(2.7)
k = 0
where f ( v ( k j ) is the derivative of the transfer function at node j evaluated at the
induced local field v ( k j at time step k . Equation 2.7 can then be extended to updating
the weights between nodes in the hidden layer h and nodes in the input layer i ( w ( t )
hi )
at time step t as (again with a single output node):
ʱ r ( t + 1)
V ( t )
ʻ t k f v ( k )
j
w ( t )
jh f v ( k )
x ( k )
i
t
w ( t )
hi
ʳV ( t + 1)
=
+
(2.8)
h
k =
0
A basic implementation of the TD( ʻ ) algorithm requires only the use of Eqs. 2.7
and 2.8 . Extending these equations using some relatively simple techniques however,
can significantly reduce the computational cost in terms of both time and space, and
this can improve the efficiency of the learning algorithm.
The use of a momentum term with coefficient ʷ can be added to Eqs. 2.7 and 2.8
in order to incorporate a portion of the weight update from the previous time step
t
1 into that for the current time step t . This has the effect of smoothing out the
network weight changes between time steps and is often most effective when training
the network in a batch manner. Batch training is where weight updates are computed
during every time step, but the updates are only applied after every n time steps where
Search WWH ::




Custom Search