Information Technology Reference
In-Depth Information
The weight is modified only if successive gradients do not change the direction:
ij E ( t− 1) ·∇ ij E ( t ) 0 . The step size is modified according to:
8
<
: η + · ( t− 1)
, if ij E ( t− 1) ·∇ ij E ( t ) > 0
ij
( t )
ij
η · ( t− 1)
ij
, if ij E ( t− 1) ·∇ ij E ( t ) < 0
=
(6.10)
.
( t− 1)
, else
ij
It is increased by a factor η + if successive updates go in the same direction and
decreased by multiplying it with η
if the gradient direction changes. In the latter
case, ij E ( t ) is set to zero to avoid a step size update in the next iteration of the
algorithm. The factors comply with 0 < η < 1 < η + . Recommended values are
η
= 0 . 5 for fast deceleration and η + = 1 . 2 for cautious acceleration of learning.
The step sizes are initialized uniformly to o . It is ensured that they do not leave
the interval [ min ,∆ max ] . The RPROP algorithm has been shown to be robust to the
choice of its parameters [107]. It is easy to implement and requires only the storage
of two additional quantities per weight: the last gradient and the step size.
Mini Batches. RPROP as well as other advanced optimization techniques are batch
methods because they need an accurate estimate of the gradient. For real-world
tasks, the training set is frequently large and contains many similar examples. In
this case, it is very expensive to consider all training examples before updating the
weights.
One idea to overcome this difficulty is to only use subsets of the training set, so-
called mini batches, to estimate the gradient. This is a compromise between batch
training and online learning. The easiest way to implement mini batches is to update
every n examples. Another possibility is to work with randomly chosen subsets of
the training set. Møller [162] investigated the effect of training with mini batches.
He proposed to start with a small set and to enlarge it as the training proceeds.
For the RPROP algorithm, the gradient estimate must not only be accurate but
stable as well. Because the signs of successive gradient evaluations determine the
adaptation of the learning rates, fluctuations of the gradient estimate may slow down
the convergence.
For this reason, I used RPROP with slowly changing random subsets of the train-
ing set. A small working set of examples is initialized randomly. In each iteration
only a small fraction of the set is replaced by new randomly chosen examples. As
training proceeds, the network error for most of the examples will be low. Hence, the
size of the working set must be increased to include enough informative examples.
The last few iterations of the learning algorithm are done with the entire training set.
Such an approach can speed up the training significantly since most iterations of
the learning algorithm can be done with only a small fraction of the training set. In
addition, the ability of the network to learn the task can be judged quickly because
during the first iterations, the working set is still small.
Search WWH ::




Custom Search