Neural Networks Approach - Computational Intelligence in Time Series Forecasting

Information Technology Reference

In-Depth Information

where

i e is the unit vector in the weight space and

T

i w is the weight connecting

the i th input of the j th hidden unit.

To solve the minimization problem, we form the corresponding Lagrangian

1

L

G

wHwe

T

G

OG

(

T

ij

ww

)

,

ij

2

where Ȝ is the Lagrange multiplier. The derivative of the Lagrangian with respect

to

G and the equation

T

ij

eww

G ,

0

ij

define the optimal weight change

w

ij

G

w

H

1

e

.

ij

[

H

1

]

ij

i w is

Correspondingly, the related optimal value of Lagrangian L for the weight

2

w

1

2[

ij

L

,

ij

H

1

]

ij

H is the i th element of the inverse Hessian matrix H . The i L value of

the Lagrangian determined in this way represents the increase of mean square error

caused by the removal of the weight

1

where

[

] ij

i w , known as saliency of the weight

i w . It

is obvious that, because the saliency depends on the square value of

ij w the small

values of weights have a low influence on the mean square error. However,

because the saliency is inversely proportional to

,

H

1

H

1

[

] ij

, small values of

[

] ij

can also have a strong influence on the mean square error.

Although pruning methods, such as optimal brain damage , and optimal brain

surgeon , rely on the weight ranking with respect to saliency, i.e . on changes in

training error caused by pruning an individual weight, there is still an essential

difference between them: the optimal brain damage procedure does not require

retraining of the network after removing a weight element, whereas the optimal

brain surgeon procedure requires this.

The disadvantage of both methods is that, if no stopping criterion is built, the

removal of the least significant weights can lead to network overfitting. As an

efficient stopping criterion, the calculation of the test error using Akaike's (1970)

final prediction error (FPE) estimation and its modification is used to cover the

estimation of average generalization error in regularized networks (Moody, 1991).

Computational Intelligence in Time Series Forecasting

Search WWH ::

Custom Search

Home