Training the Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

where w ∗ N is the weight vector that satisfies (5.16) and thus, h N ( w ∗ N )isthemi-

nimal MSE, and λ j is the j th of the J eigenvalues of c − 1 X N M N X N . This shows

that the excess MSE estimate is i) always positive, and ii) is proportional to the

step size γ . Thus, reducing the step size also reduces the misadjustment. Indeed,

under the standard stochastic approximation assumptions that n =1 γ n =

∞

and n =1 γ t <

∞

, the Lipschitz continuity of the gradient, and some Pseudo-

gradient property of the gradient, convergence to the optimal estimate can be

guaranteed [17, Prop. 4.1].

Stability Criteria and Average Time Constant

As the LMS filter is a traversal filter of length one, using only the current ob-

servation for its gradient estimate, no concrete bounds for the step size can be

currently given [105, Chap. 6]. However, if the step size is small when compared

to the inverse of the largest eigenvalue of the input vector correlation matrix,

then the stability criteria are the same as for steepest gradient descent (5.20).

As the gradient changes with each step, we can only give an expression for the

local time constant that varies with time (for more details see [78]). On average,

however, the time constant can be bounded in the same way as for steepest

gradient descent (5.21), with the same consequences.

This leaves us in a dilemma: it was already established that the misadjustment

is proportional to the step size. On the other hand, the time constant is inversely

proportional to it. Hence, we have conflicting requirements and can either aim

for a low estimation error or a fast rate of convergence, but will not be able to

satisfy both requirements with anything other than a compromise.

Relation to Batch Learning

To get a better intuitive understanding of how the LMS algorithm estimates the

weight vector, let us reformulate it as a batch learning approach for the simplified

case of an averaging classifier that matches all inputs, that is x n =1 ,m ( x n )=1

for all n> 0. In that case, (5.25) reduces to

w N +1 = w N + γ N +1 ( y N +1 −

w N ) ,

(5.27)

and by recursive substitution (as in Example 3.2) results in the batch learning

formulation

w N =

y n γ n

−

γ m )+ w 0

−

γ n ) .

(5.28)

n =1

m = n +1

n =1

Hence, the n th observation y n is weighted by γ n m = n +1 (1

−

γ m ), which, for

0 <γ n < 1 for all 0 < n

n , means that the lower n ,theless y n contributes to

the weight estimate. Also, w 0 introduces a bias that decays exponentially with

≤

n =1 (1

γ n ). Comparing this insight to the results of Example 5.2, where it was

shown that the optimal weight in the least squares sense for averaging classifiers

−

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home