Information Technology Reference
In-Depth Information
2.5.3 Adaptive (On-Line) Training of Models that Are Nonlinear
with Respect to Their Parameters
In the previous sections, we discussed methods that optimize the least squares
cost function by using all the training data available at the beginning of
training: the gradient of the total cost can be computed as the sum of the
gradients of the partial costs.
In adaptive (on-line) training, parameters are updated by using the gradi-
ent of the partial cost for each example, so that training can start even before
all training data is available. Such a procedure is often useful to update a
model after an initial nonadaptive training. Those methods are discussed in
detail in Chap. 4.
A variant of adaptive training algorithms consists in updating the para-
meters after reception of a block of data (“block training”): then the partial
cost is not related to a single example but to a block of examples.
The most popular adaptive training technique is called stochastic gradient,
whereby the parameter updates are proportional to the gradient of the partial
cost,
J k ( w k ) ,
where w k is the value of the vector of parameters after iteration k , i.e., after
updating the parameters from example k . Note that the LMS algorithm, dis-
cussed in the framework of the training of linear models, is a particular case
of stochastic gradient.
Some empirical results suggest that the stochastic gradient method avoids
local minima more e ciently than simple gradient descent in batch learning.
An alternative technique, stemming from adaptive filtering, can be used
for neural network training: the extended Kalman filter [Puskorius et al. 1994].
It is more e cient than stochastic gradient in terms of convergence speed, but
the number of operations per iteration is higher. That approach is described
in detail in Chap. 4.
w k +1 = w k
µ k
2.5.4 Training with Regularization
As stated in Chap. 1, the objective of black-box modeling is the design of a
model that is complex enough to learn the training data, but does not exhibit
overfitting, i.e., does not adjust to noise. Two categories of strategies can be
used.
Passive techniques: several models, of different complexities, are trained
as indicated in the previous section, and a selection between those models
is performed after training, in order to discard models that exhibit over-
fitting; that is done by cross-validation or statistical tests as explained in
the next section.
Search WWH ::




Custom Search