Biomedical Engineering Reference
In-Depth Information
test set or validation set. We then use the training set to fit the machine, and
evaluate its performance on the test set. (This is an instance of resampling our
data, which is a useful trick in many contexts.) Because we've made sure the test
set is independent of the training set, we get an unbiased estimate of the out-of-
sample performance.
In cross-validation , we divide our data into random training and test sets
many different ways, fit a different machine for each training set, and compare
their performances on their test sets, taking the one with the best test-set per-
formance. This reintroduces some bias—it could happen by chance that one test
set reproduces the sampling quirks of its training set, favoring the model fit to
the latter. But cross-validation generally reduces over-fitting, compared to sim-
ply minimizing the empirical error; it makes more efficient use of the data,
though it cannot get rid of sampling noise altogether.
2.1.2.
Regularization or Penalization
I said that the problem of minimizing the error is ill-posed , meaning that
small changes in the errors can lead to big changes in the optimal parameters. A
standard approach to ill-posed problems in optimization theory is called regu-
larization . Rather than trying to minimize
ˆ ()
L R alone, we minimize
ˆ ()
L
RMR
+
d
(),
[1]
ˆ ()
where d (R) is a regularizing or penalty function. Remember that
L R = E [ L (R)]
+ F, where F is the sampling noise. If the penalty term is well-designed, then the
R which minimizes
E [ L (R)] + F + M d (R)
[2]
will be close to the R that minimizes E [ L (R
—it will cancel out the effects of
favorable fluctuations. As we acquire more and more data, F 0, so M, too, goes
to zero at an appropriate pace, and the penalized solution will converge on the
machine with the best possible generalization error.
How then should we design penalty functions? The more knobs and dials
there are on our machine, the more opportunities we have to get into mischief by
matching chance quirks in the data. If one machine has fifty knobs and another
fits the data just as well but has only a single knob, we should (the story goes)
chose the latter—because it's less flexible the fact that it does well is a good in-
dication that it will still do well in the future. There are thus many regularization
methods that add a penalty proportional to the number of knobs, or, more for-
mally, the number of parameters. These include the Akaike information crite-
rion or AIC (23) and the Bayesian information criterion or BIC (24,25). Other
)]
Search WWH ::




Custom Search