Biomedical Engineering Reference
In-Depth Information
the actual outputs are Y , is L ( Y , f ( X ,R)). Given particular values y and x , we have
the empirical loss L ( y , f ( x ,R)), or
ˆ ()
L R for short. 4
Now, a natural impulse at this point is to twist the knobs to make the loss
small: i.e., to select the R that minimizes
L R ; let's write this as follows: R ˆ =
ˆ ()
argmin R ˆ ()
L R . This procedure is sometimes called empirical risk minimiza-
tion , or ERM. (Of course, doing that minimization can itself be a tricky nonlin-
ear problem, but I will not cover optimization methods here.) The problem with
ERM is that the R ˆ we get from this data will almost surely not be the same as
the one we'd get from the next set of data. What we really care about, if we think
it through, is not the error on any particular set of data, but the error we can ex-
pect on new data, E [ L (R)]. The former,
ˆ ()
L R , is called the training or in-sample
or empirical error; the latter, E [ L (R)], the generalization or out-of-sample or
true error. The difference between in-sample and out-of-sample errors is due to
sampling noise, the fact that our data are not perfectly representative of the sys-
tem we're studying. There will be quirks in our data which are just due to
chance, but if we minimize ˆ blindly, if we try to reproduce every feature of the
data, we will be making a machine that reproduces the random quirks, which do
not generalize, along with the predictive features. Think of the empirical error
ˆ ()
L R as the generalization error, E [ L (R)], plus a sampling fluctuation, F. If we
look at machines with low empirical errors, we will pick out ones with low true
errors, which is good, but we will also pick out ones with large negative sam-
pling fluctuations, which is not good. Even if the sampling noise F is very small,
R ˆ can be very different from R min . We have what optimization theory calls an ill-
posed problem (22).
Having a higher-than-optimal generalization error because we paid too
much attention to our data is called over-fitting . Just as we are often better off if
we tactfully ignore our friends' and neighbors' little faults, we want to ignore the
unrepresentative blemishes of our sample. Much of the theory of data mining is
about avoiding over-fitting. Three of the commonest forms of tact it has devel-
oped are, in order of sophistication, cross-validation , regularization (or bold
penalties ) and capacity control .
2.1.1.
Validation
We would never over-fit if we knew how well our machine's predictions
would generalize to new data. Since our data is never perfectly representative,
we always have to estimate the generalization performance. The empirical error
provides one estimate, but it's biased towards saying that the machine will do
well (since we built it to do well on that data). If we had a second, independent
set of data, we could evaluate our machine's predictions on it, and that would
give us an unbiased estimate of its generalization. One way to do this is to take
our original data and divide it, at random, into two parts, the training set and the
Search WWH ::




Custom Search