Information Technology Reference
In-Depth Information
noisy measurements, the data is almost certainly not completely correct. Hence,
we want to find a model that represents the general pattern in the training data
but does not model its noise. The field that deals with this issue is known as
model selection . Learning a model such that it perfectly fits the training set but
does not provide a good representation of f is known as overfitting .Theoppo-
site, that is, learning a model where the structural bias of the model dominates
over the information included from the training set, is called underfitting .
While in LCS several heuristics have been applied to deal with this issue, it
has never been characterised explicitly. In this and the following chapters the
aim is considered to be the minimisation of the empirical risk. In Chap. 7, we
return to the topic of model selection, and show how it can be handled with
respect to LCS it in a principled manner.
1
0.02
Observed f(x)
Real f(x)
1st order
2nd order
4th order
10th order
Empirical Risk
Expected Risk
0.9
0.8
0.015
0.7
0.6
0.5
0.01
0.4
0.3
0.005
0.2
0.1
0
0
0
0.1
0.2
0.3
0.4
0. 5
0.6
0.7
0.8
0.9
1
0
1
2
3
4
5
6
7
8
9
10
x
Degree of Polynomial
(a)
(b)
Fig. 3.1. Comparing the fit of polynomials of various degrees to 100 noisy observations
of a 2nd-order polynomial. (a) shows the data-generating function, the available ob-
servations, and the least-squares fit of polynomials of degree 1, 2, 4, and 10. (b) shows
how the expected and empirical risk changes with the degree of the polynomial. More
information is given in Example 3.1.
Example 3.1 (Expected and Empirical Risk of Fitting Polynomials of Various
Degree). Consider the data-generating function f ( x )=1 / 3
x/ 2+ x 2 ,whose
observations, taken over the range x
[0 , 1], are perturbed by Gaussian noise
with a standard deviation of 0 . 1. Assuming no knowledge of f ( x ), and given
only its observations, let us hypothesise that the data was indeed generated by
a polynomial of some degree d , as described by the model
d
f d ( x ; θ )=
θ n x n ,
(3.3)
n =0
d +1 is the parameter vector of that model. The aim is to find the
degree d that best describes the given observations.
where θ
R
 
Search WWH ::




Custom Search