Supervised Learning - Hierarchical Neural Networks for Image Interpretation

Information Technology Reference

In-Depth Information

6.2.3 Regularization

As already discussed in Section 6.1, the capacity of a learning machine must be

appropriate for the task to ensure generalization. One way to restrict the capacity of

a neural network is to use few adaptable parameters. This can be done by keeping

the network small or by sharing weights.

Another way is to lower the capacity of a high-capacity machine by regular-

ization. Regularization constrains the parameters such that only smooth approxima-

tions to the training set are possible.

It was already mentioned that early stopping has a regularizing effect. The rea-

son for this is that weights of a neural network are still relatively small when the

training is stopped early. This limits the nonlinearity of the network since the trans-

fer functions are almost linear for small inputs. Limited nonlinearity yields decision

functions that smoothly approximate the training set.

Weight Decay. Another possibility to regularize a neural network is weight decay.

Krogh and Hertz [128] proposed adding a term to the cost function E that penalizes

large weights:

X

1

2 λ

w k ,

E d = E +

(6.11)

k

where λ is a parameter that determines the strength of the penalty. If gradient descent

is used for learning, the penalty leads to a new term − λw k in the weight update:

∆w k = − η ∂E d

∂w k

∂E

∂w k

= − η

+ λw k

(6.12)

.

The new term would decay weights exponentially if no forces from the cost function

E were present.

Low-Activity Prior. It is also possible to include terms in the cost function that

enforce properties of the representation present at hidden units. For instance, one

can force units to have a low average activity, e.g. α = 0 . 1 :

X

1

2 λ

( h o k i− α ) 2 ,

E a = E +

(6.13)

k

where h o k i denotes the expected value of the activity of unit k . Gradient descent

yields the additional term λ ( h o k i − α ) that must be multiplied with the deriva-

tive of the transfer function of unit k and added to its error component δ k . A low-

activity prior for hidden units, combined with a force that produces variance, can

yield sparse representations.

6.3 Recurrent Neural Networks

So far, the function graph describing the neural network was acyclic. If the graph of

primitive functions contains cycles, it is called recurrent neural network (RNN). In

Hierarchical Neural Networks for Image Interpretation

Search WWH ::

Custom Search

Home