Information Technology Reference
In-Depth Information
a ) 2 ] is minimized. As the probability distribution is un-
known, it is not possible to compute its mathematical expectation. The gra-
dient of J (i.e., its first derivative) is
J ( a )=1 / 2 E [( X
J ( a )= E ( X
a ). A gradient descent
algorithm is
a k +1 = a k
γ k +1
J ( a k ) ,
where γ k +1 is a positive scalar.
Replacing the gradient
J ( a k ) by the expression ( X k +1
a k ) yields the
empirical mean recursive estimate,
a k +1 = a k
γ k +1 ( X k +1
a k ) .
Note that the expectation of the random variable ( X k +1
a k )is E ( X )
a k ,
which is exactly the value
J ( a k ) of the gradient of the cost function J taken
at the current value a k of the parameter estimate. Therefore, this algorithm
is termed stochastic gradient. In the gradient descent, the gradient of the cost
function has been replaced by a random variable, whose expectation is equal
to this gradient. The stochastic gradient is known at any time, whereas the
“total” gradient depends on the law of X , which is unknown. The stochastic
gradient algorithm has already been mentioned in Chap. 2. We will study it
more in detail in the following.
That algorithm performs the optimization task without prior estimation
of the unknown probability distribution. Actually, the optimization and the
estimation tasks are performed simultaneously. By contrast, in the classical
estimation process, the estimation phase is performed first. In that phase, the
criterion is estimated by the associated empirical criterion, here the estimate
of J ( a )=1 / 2 E [( X
a ) 2 ] would be J N ( a )=1 / 2 N k =1 ( x k
a ) 2 ; then,
the optimization phase is performed using that estimate. Actually, on that
example, the two approaches lead to the same result because the model is
linear with respect to the parameter of interest, namely a . Nevertheless, the
implementations of the two algorithms are different: the implementation of
the stochastic gradient is recursive.
4.3.2 Recursive Estimation of Linear Regression
The stochastic gradient approach that we have just used to estimate the em-
pirical mean is frequently used for linear and nonlinear regression. In the con-
text of linear regression, the algorithm is named recursive least mean squares
(LMS) or Widrow-Hoff algorithm. This algorithm is well known in signal the-
ory, where it is used to compute a linear regression in an adaptive way (see
also Chap. 2).
Consider the problem of the minimization of J ( w )=1 / 2 E [( Y
b ) 2 ]
with respect to w ,where X is a second-order random vector (1 ,n ) (i.e., its
expectation and covariance matrices are well defined). The vector parameter
Xa
Search WWH ::




Custom Search