Training the Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

Hence, when having a w N that minimises

y N M N , both the sequence

of inputs and output estimates are M N -orthogonal to the estimation errors.

In other words, the hyperplane spun by the vectors X N and X N w N is M N -

orthogonal to the vector of estimation errors ( X N w N −

X N w N −

y N ), and therefore, the

output estimate is an orthogonal projection onto this hyperplane with respect to

· M N . This conforms to the batch learning solution introduced in Sect. 5.2.1.

5.3.2

Steepest Gradient Descent

Steepest gradient descent is a well-known method for function minimisation,

based on following the gradient of that function. Applied to (5.5), it can be used

to find the weight vector that minimises the squared error. However, it is only

applicable if all observations are known at once, which is not the case when

performing incremental learning. Nonetheless, it is discussed here as it gives

valuable insights into the stability and speed of convergence of other gradient-

based incremental learning methods that are described in later sections.

As for batch learning, let X , y , M and c be the output matrix, the input vector,

the matching vector, and the match count respectively, given all N observations.

Then, steepest gradient descent is defined by

2 ∇ w n

2 M ,

γ n +1 1

w n +1 = w n −

Xw n −

(5.18)

starting at some arbitrary w 0 , and hence generating a sequence of weight vec-

tors

by performing small steps along the gradient of the squared

error. Note that n does in this case refer to the iteration number of the method

rather than to the index of the observation, and γ n > 0 is the step size in the

n th iteration. Evaluating the gradient

{

w 0 , w 1 ,...

}

∇ w n with respect to w n results in the

algorithm

γ n +1 X T M ( Xw n −

w n +1 = w n −

y ) .

(5.19)

With each step along the gradient, steepest gradient descent reduces the squared

error. As the error function is convex and hence has a unique minimum, following

its gradient will lead to this minimum and hence, solves (5.5).

Stability Criteria

By definition, the step size γ n can change at each iteration. When kept constant,

that is γ n = γ for all n> 0, and the gradient is Lipschitz continuous 1 ,then

the steepest gradient descent method is guaranteed to converge to the minimum

(5.5), if that minimum exists [17, Prop. 3.4]. In our case the gradient as a function

of w is Lipschitz continuous and hence, convergence for a constant step size is

guaranteed.

1 A function f : A → A is Lipschitz continuous if there exists a finite constant scalar

K such that

f ( a ) − f ( b ) ≤Ka − b

for any a, b

∈

A . The magnitude K is a

measure of the continuity of the function f .

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home