Information Technology Reference
In-Depth Information
Overfitting and the Bias-Variance Dilemma
Since the accuracy of the uniform approximation of a given function by a
neural network increases as the number of hidden neurons increases, a naıve
design methodology would consist in building the network with as many neu-
rons as possible. However, as mentioned above, in real engineering problems,
the network is not required to approximate a known function uniformly, but to
approximate an unknown function (the regression function) from a finite num-
ber of experimental points (the training set); therefore, the network should
not only fit the experimental points as closely as possible (in the least squares
sense), but it should also generalize e ciently, i.e., give a satisfactory re-
sponse to situations that are not present in the training set. The di culty
here is that there is no operational definition of the meaning of satisfactory ,
since the regression function is unknown: the problem of generalization is an
ill-posed problem . Therefore, the design problem is the following:
if the neural network has too many parameters (it is said to be over-
parameterized), it will be too “flexible,” so that its output will fit very
accurately all points of the training set (including the noise present in
these points), but it will provide meaningless responses in situations that
are not present in the training set. That is known as overfitting .
by contrast, a neural network with too few parameters will not be complex
enough to match the complexity of the (unknown) regression function, so
that it will not be able to learn the training data.
This dilemma, known as the bias-variance dilemma , is the basic problem that
the model designer is faced with.
Figure 1.14 shows the results obtained after training two different net-
works, with different numbers of hidden neurons (hence of parameters) with
sigmoid activation functions, from the same training set: clearly, the most
parsimonious model (i.e., the model with the smallest number of parameters)
generalizes best. In practice, the number of parameters should be small with
respect to the number of elements of the training set. The parsimony of neural
networks with sigmoid activation functions is a valuable asset in the design of
models that do not exhibit overfitting.
Figure 1.14 shows clearly which candidate neural network is most ap-
propriate. When the model has several inputs, the result cannot be exhib-
ited graphically in such a straightforward fashion: a quantitative performance
index must be defined. The most popular way of estimating such an index is
the following: in addition to the training set, one should build a validation
set , made of observations that are distinct from those of the training set, from
which a performance index is computed. The most frequently used criterion
is the mean square error on the validation set (VMSE), defined as:
N V
1
N V
g ( x k , w )] 2
[ y k
VMSE =
k =1
 
Search WWH ::




Custom Search