Information Technology Reference
In-Depth Information
pattern i . The generalization of expression (5.9) to a p -dimensional output
(the dimension of e or the number of network outputs), is given by
K 0 e( i )
h
,
n
1
nh p
f (0; w)=
(6.16)
i =1
where h represents the bandwidth of the kernel K . Using the Gaussian kernel
with zero mean and unit covariance in this expression, the estimator for the
error density becomes
exp
.
n
e( i ) T e( i )
2 h 2
1
2 πnh p
f (0; w)=
(6.17)
i =1
Due to reasons discussed in [214] and Chap. 5, related to the speed of conver-
gence of ZED, instead of using expression (6.17) the following simplification
shall be used (in line with the EXP risk):
exp
;
n
e( i ) T e( i )
2 h 2
f (0; w)= h 2
(6.18)
i =1
given that the difference relies only on constant terms, the same extrema are
found. The gradient of (6.18) is:
exp
e( i ) e( i )
w
∂ f (0; w)
w
n
e( i ) T e( i )
2 h 2
=
.
(6.19)
i =1
Since the search is for the weights yielding the maximum of the error density
at 0, the network weight update shall be made by
Δ w = η ∂ f (0; w)
w
,
(6.20)
where η stands for the learning rate.
This adaptation of ZED will change expression (6.19) to
exp
e( t − i ) e( t
, (6.21)
∂ f (0 ,t ; w)
w
L
e( t
i ) T e( t
i )
i )
=
2 h 2
w
i =1
where instead of computing the density over the n data instances of a training
set, it is computed over the last L errors of the RNN. Notice that the time
dependency of the gradient of the density is now explicit. This approach is
an approximation to the real gradient of the density since it uses error values
from different time steps to create an estimate of the error density. Given
that learning is online and the weights are adjusted at each time step, the
construction of a density from errors at different time steps is valid if L is
 
Search WWH ::




Custom Search