Information Technology Reference
In-Depth Information
incremental learning approaches that lead to similar results will also be discussed.
Still, the prototype system that is developed is only fully described from the batch
learning perspective. How to turn this system into an incremental learner is a topic
of future research.
3.2
LCS as Parametric Models
While the term model may be used in many different ways, it is here defined
as a collection of possible hypotheses about the data-generating process. Hence,
the choice of model determines the available hypotheses and therefore biases the
expressiveness about this process. Such a bias represents the assumptions that
are made about the process and its stochasticity. Understanding the assumpti-
ons that are introduced with the model allows for making statements about its
applicability and performance.
Example 3.3 (Different Linear Models and their Assumptions). A linear relation
between inputs and outputs with constant-variance Gaussian noise leads to
least squares (that is, using the L 2 loss function) linear regression. Alternatively,
assuming the noise to have a Cauchy distribution results in linear regression
using the L 1 loss function. As a Cauchy distribution has a longer tail than a
Gaussian distribution, it is more resilient to outliers. Hence it is considered as
being more robust, but the L 1 norm makes it harder to train [66]. This shows
how an assumption of a model about the data-generating process can give us
information about its expected performance.
Training a model means finding the hypothesis that is closest to what the data-
generating process is assumed to be. For example, in a linear regression model the
space of hypotheses is all hyper-planes in the input/output space, and performing
linear regression means picking the hyper-plane that best explains the available
observations.
The choice of model strongly determines how hard it is to train. While more
complex models are usually able to express a larger range of possible hypothe-
ses, this larger range also makes it harder for them to avoid overfitting and
underfitting. Hence, very often, overfitting by minimising the empirical risk is
counterbalanced by reducing the number of hypotheses that a model can express,
thus making the assumptions that a model introduces more important.
Example 3.4 (Avoiding Overfitting in Artificial Neural Networks). Reducing the
number of hidden neurons in a feed-forward neural network is a popular mea-
sure of avoiding overfitting the training data. This measure effectively reduces
the number of possible hypothesis that the model is able to express and as such
introduces a stronger structural bias. Another approach to avoiding overfitting
in neural networks training is weight decay that exponentially decays the magni-
tude of the weight of the neural connections in the network. While not initially
designed as such, weight decay is equivalent to assuming a zero mean Gaussian
prior on the weights and hence biasing them towards smaller values. This prior
is again equivalent to assuming smoothness of the target function [106].
 
Search WWH ::




Custom Search