Information Technology Reference
In-Depth Information
where w N is the weight vector that satisfies (5.16) and thus, h N ( w N )isthemi-
nimal MSE, and λ j is the j th of the J eigenvalues of c 1 X N M N X N . This shows
that the excess MSE estimate is i) always positive, and ii) is proportional to the
step size γ . Thus, reducing the step size also reduces the misadjustment. Indeed,
under the standard stochastic approximation assumptions that n =1 γ n =
and n =1 γ t <
, the Lipschitz continuity of the gradient, and some Pseudo-
gradient property of the gradient, convergence to the optimal estimate can be
guaranteed [17, Prop. 4.1].
Stability Criteria and Average Time Constant
As the LMS filter is a traversal filter of length one, using only the current ob-
servation for its gradient estimate, no concrete bounds for the step size can be
currently given [105, Chap. 6]. However, if the step size is small when compared
to the inverse of the largest eigenvalue of the input vector correlation matrix,
then the stability criteria are the same as for steepest gradient descent (5.20).
As the gradient changes with each step, we can only give an expression for the
local time constant that varies with time (for more details see [78]). On average,
however, the time constant can be bounded in the same way as for steepest
gradient descent (5.21), with the same consequences.
This leaves us in a dilemma: it was already established that the misadjustment
is proportional to the step size. On the other hand, the time constant is inversely
proportional to it. Hence, we have conflicting requirements and can either aim
for a low estimation error or a fast rate of convergence, but will not be able to
satisfy both requirements with anything other than a compromise.
Relation to Batch Learning
To get a better intuitive understanding of how the LMS algorithm estimates the
weight vector, let us reformulate it as a batch learning approach for the simplified
case of an averaging classifier that matches all inputs, that is x n =1 ,m ( x n )=1
for all n> 0. In that case, (5.25) reduces to
w N +1 = w N + γ N +1 ( y N +1
w N ) ,
(5.27)
and by recursive substitution (as in Example 3.2) results in the batch learning
formulation
N
N
N
w N =
y n γ n
γ m )+ w 0
γ n ) .
(1
(1
(5.28)
n =1
m = n +1
n =1
Hence, the n th observation y n is weighted by γ n m = n +1 (1
γ m ), which, for
0 n < 1 for all 0 < n
n , means that the lower n ,theless y n contributes to
the weight estimate. Also, w 0 introduces a bias that decays exponentially with
n =1 (1
γ n ). Comparing this insight to the results of Example 5.2, where it was
shown that the optimal weight in the least squares sense for averaging classifiers
 
Search WWH ::




Custom Search