Information Technology Reference
In-Depth Information
Smoothness of the function is required to express that the process generates
similar outputs for similar inputs. That is, given two inputs x, x that are close in
X
, their associated outputs y, y on average need to be close in
. This property
is required in order to make predictions: if it did not hold, then we could not
generalise over the training data, as relations between inputs do not transfer to
relations between outputs, and thus we would be unable to predict the output
for an input that is not in the training set. There are several ways of ensuring
the smoothness of a function, such as by limiting its energy of high frequencies
in the frequency domain [94]. Here, smoothness is dealt with from an intuitive
perspective rather than in any formal way.
As discussed before, the process may be stochastic and the measurements of
the output may be noisy. This stochasticity is modelled by the random variable
, which has zero mean, such that for an observation ( x, y )wehave
Y
( y )=
f ( x ). The distribution of is determined by the process stochasticity and the
measurement noise.
With this formulation, a model with structure
E
M
has to provide a hypothesis
of the form f M :
. In order to be a good model, f M has to be close to
f . To be more specific, let L :
X→Y
+
Y×Y → R
be a loss function that describes
,thatisL( y, y ) > 0 for all y
= y ,andL( y, y )=0
adistancemetricin
Y
f M close to f we want to minimise the expected
otherwise. To get a hypothesis
risk
X L( f ( x ) , f M ( x ))d p ( x ) ,
(3.1)
where p ( x ) is the probability density of having input x .Inotherwords,ouraim
is to minimise the distance between the output of the data-generating process
and our model of it, for each input x weighted by the probability of observing
it.
The expected risk cannot be minimised directly, as f is only accessible by a
finite set of observations. Thus, when constructing the model one needs to rely
on an approximation of the expected risk, called the empirical risk and defined
as
N
1
N
L( y n , f M ( x n )) ,
(3.2)
n =1
which is the average loss of the model over all available observations. Depending
on the definition of the loss function, minimising the empirical risk can result
in least squares learning or the principle of maximum likelihood [218]. By the
law of large numbers, the empirical risk converges to the expected risk almost
surely with the number of observations tending to infinity, but for a small set
of observations the two measures might be quite different. How to minimise the
expected risk based on the empirical risk forms the basis of statistical learning
theory, for which Vapnik has written a good introduction with slightly different
definitions [218].
We could simply proceed by minimising the empirical risk. That this approach
will not lead to an adequate result is shown by the following observation: the mo-
del that minimises the empirical risk is the training set itself. However, assuming
 
Search WWH ::




Custom Search