Information Technology Reference
In-Depth Information
Thetruefunction f ( x ) and the given observations are shown in Fig. 3.1(a),
together with fitted polynomials of degree 1, 2, 4, and 10, using the loss function
L( y, y )=( y
y ) 2 . The 1st-degree polynomial f 1 (that is, the straight line)
clearly underfits the data. This is confirmed by its high expected and empirical
risk when compared to other models, as shown in Fig. 3.1(b). On the other
hand, the 2nd-degree polynomial f 2 , that conforms to the true data-generating
model, represents the data well and is close to f ( x ) (but not equivalent, due
to the finite number of observations). Still, having no knowledge of f ( x )one
has no reason to stop at d = 2, particularly when observing in Fig. 3.1(b) that
increasing d reduces the empirical risk further. The expected risk, however, rises,
which indicates that the models start to overfit the data by modelling its noise.
This is clearly visible for the fit of f 10 to the data in Fig. 3.1(a), which is closer
to the observations than f 2 , but further away from f .
The trend of the expected and the empirical risk in Fig. 3.1(b) is a common
one: an increase of the model complexity (which is in our case represented by
d ) generally causes a decrease in the empirical risk. The expected risk, however,
only decreases up to a certain model complexity, from which on it starts to
increase due to the model overfitting the data. Thus, the aim is to identify the
model that minimises the expected risk, which is complicated by the fact that
this risk measure is usually not directly accessible. One needs to resort to using
the empirical risk in combination with some measure of the complexity of the
model, and finding such a measure makes finding the best model a non-trivial
problem.
3.1.2
Regression
Both regression and classification tasks aim at finding a hypothesis for the data-
generating process such that some risk measure is minimised, but differ in the
nature of the input and output space. A regression task is characterised by a
multidimensional real-valued input space
X
=
R
D X with D X
dimensions and
a multidimensional real-valued output space
D Y with D Y dimensions.
Thus, the inputs are column vectors x =( x 1 ,...,x D X ) T and the corresponding
outputs are column vectors y =( y 1 ,...,y D Y ) T . In the case of batch learning it
is assumed that N observations ( x n , y n ) are available in the form of the input
matrix X and output matrix Y ,
Y
=
R
x 1
.
y 1
.
X
,
Y
.
(3.4)
x N
y N
The loss function is commonly the L 2 norm, also known as the Euclidean
distance , and is defined by L 2 ( y , y )
y i ) 2 1 / 2 . Hence,
the loss increases quadratically in all dimensions with the distance from the
desired value. Alternatively, the L 1 norm, also known as the absolute distance,
y , y 2 = i ( y i
Search WWH ::




Custom Search