Information Technology Reference
In-Depth Information
2.10.6 Line Search Methods for the Training Rate
At iteration
i
of an optimization method, an update direction must be com-
puted; in BFGS for instance, the direction
d
i
=
1)) is com-
puted by evaluating the gradient of the cost function by backpropagation,
and by computing matrix
M
i
as indicated in the previous section; in simple
gradient, the update direction is
d
i
=
−
M
i
∇
J
(
w
(
i
−
1)). The magnitude of
the parameter update along that direction depends on the value of
µ
i
:one
seeks the value of
µ
i
for which the cost function will be minimum after updat-
ing the parameter vector along that direction, for which
J
(
w
) is minimum if
w
=
w
(
i
−∇
J
(
w
(
i
−
1) +
µ
i
d
i
. Insofar as
µ
i
is the only unknown, the problem is a line
search problem. That search must be performed at each iteration of the train-
ing procedure: therefore, it must be fast; since the value of
µ
i
is not critical
for second-order methods, a simple technique may be used. Nash's method
produces satisfactory results; it seeks a step such that the updated value of
the cost function is smaller than a given bound.
The technique seeks a step that complies with the descent condition,
J
w
(
i
−
1) +
µ
i
d
i
≤
1)) +
m
1
µ
i
d
i
T
−
J
(
w
(
i
−
∇
J
(
w
(
i
−
1))
,
where
m
1
is a factor, much smaller than 1 (typically
m
1
=10
−
3
). The research
proceeds iteratively as follows:
µ
i
is given an arbitrary positive value. The up-
per boundary condition is checked. If it is obeyed, the parameter update is
accepted. Otherwise, the step is multiplied by a quantity smaller than 1 (typ-
ically 0.2), and the condition is tested again. The procedure is iterated until
satisfaction. If the step value becomes too small, e.g., on the order of 10
−
16
,
without the condition being obeyed, or if the number of such search iterations
exceeds a limit, the search is abandoned and the procedure is terminated.
An even simpler strategy, often used in conjunction with the Levenberg-
Marquardt technique [Bishop 1995], is the following: we denote by
r>
1
(generally equal to 10) a scale factor. At the beginning of the algorithm,
µ
0
is initialized to a large value ([Bishop 1995] suggests 0.1). At iteration
i
of the
algorithm:
1. Compute
J
(
w
(
i
)) with
µ
i
computed at the previous step.
2. If
J
(
w
(
i
))
<J
(
w
(
i
1)), then accept the update and multiply
µ
i
by
r
.
3. Otherwise, retrieve
w
(
i
−
1) and multiply
µ
i
by
r
. Iterate until a value of
µ
i
producing a decrease of
J
is found.
The above procedure requires a small number of matrix inversions at each
iteration of the algorithm. However, the choice of the initial step has an in-
fluence on the rate of convergence of the algorithm. That drawback can be
circumvented by a method that requires a larger number of matrix inversions:
−
1. Initialize
µ
0
top an arbitrary value.
2. Compute
J
(
w
(
i
)) with
µ
i
found at the previous step.
Search WWH ::
Custom Search