Information Technology Reference
In-Depth Information
2.10.6 Line Search Methods for the Training Rate
At iteration i of an optimization method, an update direction must be com-
puted; in BFGS for instance, the direction d i =
1)) is com-
puted by evaluating the gradient of the cost function by backpropagation,
and by computing matrix M i as indicated in the previous section; in simple
gradient, the update direction is d i =
M i
J ( w ( i
1)). The magnitude of
the parameter update along that direction depends on the value of µ i :one
seeks the value of µ i for which the cost function will be minimum after updat-
ing the parameter vector along that direction, for which J ( w ) is minimum if
w = w ( i
−∇
J ( w ( i
1) + µ i d i . Insofar as µ i is the only unknown, the problem is a line
search problem. That search must be performed at each iteration of the train-
ing procedure: therefore, it must be fast; since the value of µ i is not critical
for second-order methods, a simple technique may be used. Nash's method
produces satisfactory results; it seeks a step such that the updated value of
the cost function is smaller than a given bound.
The technique seeks a step that complies with the descent condition,
J w ( i
1) + µ i d i
1)) + m 1 µ i d i T
J ( w ( i
J ( w ( i
1)) ,
where m 1 is a factor, much smaller than 1 (typically m 1 =10 3 ). The research
proceeds iteratively as follows: µ i is given an arbitrary positive value. The up-
per boundary condition is checked. If it is obeyed, the parameter update is
accepted. Otherwise, the step is multiplied by a quantity smaller than 1 (typ-
ically 0.2), and the condition is tested again. The procedure is iterated until
satisfaction. If the step value becomes too small, e.g., on the order of 10 16 ,
without the condition being obeyed, or if the number of such search iterations
exceeds a limit, the search is abandoned and the procedure is terminated.
An even simpler strategy, often used in conjunction with the Levenberg-
Marquardt technique [Bishop 1995], is the following: we denote by r> 1
(generally equal to 10) a scale factor. At the beginning of the algorithm, µ 0
is initialized to a large value ([Bishop 1995] suggests 0.1). At iteration i of the
algorithm:
1. Compute J ( w ( i )) with µ i computed at the previous step.
2. If J ( w ( i )) <J ( w ( i
1)), then accept the update and multiply µ i by r .
3. Otherwise, retrieve w ( i
1) and multiply µ i by r . Iterate until a value of
µ i producing a decrease of J is found.
The above procedure requires a small number of matrix inversions at each
iteration of the algorithm. However, the choice of the initial step has an in-
fluence on the rate of convergence of the algorithm. That drawback can be
circumvented by a method that requires a larger number of matrix inversions:
1. Initialize µ 0 top an arbitrary value.
2. Compute J ( w ( i )) with µ i found at the previous step.
Search WWH ::




Custom Search