Information Technology Reference
In-Depth Information
Hence, when having a w N that minimises
y N M N , both the sequence
of inputs and output estimates are M N -orthogonal to the estimation errors.
In other words, the hyperplane spun by the vectors X N and X N w N is M N -
orthogonal to the vector of estimation errors ( X N w N
X N w N
y N ), and therefore, the
output estimate is an orthogonal projection onto this hyperplane with respect to
·
,
· M N . This conforms to the batch learning solution introduced in Sect. 5.2.1.
5.3.2
Steepest Gradient Descent
Steepest gradient descent is a well-known method for function minimisation,
based on following the gradient of that function. Applied to (5.5), it can be used
to find the weight vector that minimises the squared error. However, it is only
applicable if all observations are known at once, which is not the case when
performing incremental learning. Nonetheless, it is discussed here as it gives
valuable insights into the stability and speed of convergence of other gradient-
based incremental learning methods that are described in later sections.
As for batch learning, let X , y , M and c be the output matrix, the input vector,
the matching vector, and the match count respectively, given all N observations.
Then, steepest gradient descent is defined by
2 w n
2 M ,
γ n +1 1
w n +1 = w n
Xw n
y
(5.18)
starting at some arbitrary w 0 , and hence generating a sequence of weight vec-
tors
by performing small steps along the gradient of the squared
error. Note that n does in this case refer to the iteration number of the method
rather than to the index of the observation, and γ n > 0 is the step size in the
n th iteration. Evaluating the gradient
{
w 0 , w 1 ,...
}
w n with respect to w n results in the
algorithm
γ n +1 X T M ( Xw n
w n +1 = w n
y ) .
(5.19)
With each step along the gradient, steepest gradient descent reduces the squared
error. As the error function is convex and hence has a unique minimum, following
its gradient will lead to this minimum and hence, solves (5.5).
Stability Criteria
By definition, the step size γ n can change at each iteration. When kept constant,
that is γ n = γ for all n> 0, and the gradient is Lipschitz continuous 1 ,then
the steepest gradient descent method is guaranteed to converge to the minimum
(5.5), if that minimum exists [17, Prop. 3.4]. In our case the gradient as a function
of w is Lipschitz continuous and hence, convergence for a constant step size is
guaranteed.
1 A function f : A → A is Lipschitz continuous if there exists a finite constant scalar
K such that
f ( a ) − f ( b ) ≤Ka − b
for any a, b
A . The magnitude K is a
measure of the continuity of the function f .
 
Search WWH ::




Custom Search