Biomedical Engineering Reference
In-Depth Information
should not be multiplied beyond necessity," or, "the model should be no rougher
than necessary." This takes complexity to be a property of an individual model,
and the hope is that a simple model that can predict the training data will also be
able to predict new data. Under many circumstances, one can prove that as the
size of a sample approaches infinity regularization will converge on the correct
model, the one with the best generalization performance (26). But one can often
prove exactly the same thing about ERM without any regularization or penaliza-
tion at all; this is what the VC bounds [5] accomplish. While regularization
methods often do well in practice, so, too, does straight ERM. If we compare the
performance of regularization methods to straight empirical error minimization
on artificial examples, where we can calculate the generalization performance
exactly, regularization sometimes conveys no clear advantage at all (45).
Contrast this with what happens in structural risk minimization. There our
complexity penalty depends solely on the VC dimension of the class of models
we're using. A simple, inflexible model which we find only because we're look-
ing at a complex, flexible class is penalized just as much as the most wiggly
member of that class. Experimentally, SRM does work better than simple ERM,
or than traditional penalization methods.
A simple example may help illuminate why this is so. Suppose we're inter-
ested in binary classification, and we find a machine R that correctly classifies a
million independent data points. If the real error rate (= generalization error) for
R was one in a hundred thousand, the chance that it would correctly classify a
million data points would be
6
(0.99999) 4.5 # 10 -5 . If R was the very first pa-
rameter setting we checked, we could be quite confident that its true error rate
was much less than 10 -5 , no matter how complicated the function f ( X ,R) looked.
But if we've looked at ten million parameter settings before finding R, then the
odds are quite good that, among the machines with an error rate of 10 -5 , we'd
find several that correctly classify all the points in the training set, so the fact
that R does is not good evidence that it's the best machine. 9 What matters is not
how much algebra is involved in making the predictions once we've chosen R,
but how many alternatives to R we've tried out and rejected. The VC dimension
lets us apply this kind of reasoning rigorously and without needing to know the
details of the process by which we generate and evaluate models.
The upshot is that the kind of complexity which matters for learning, and so
for Occam's Razor, is the complexity of classes of models , not of individual
models nor of the system being modeled. It is important to keep this point in
mind when we try to measure the complexity of systems (ยง8).
10
2.4. Relation of Complex Systems Science to Statistics
Complex systems scientists often regard the field of statistics as irrelevant
to understanding such systems. This is understandable, since the exposure most
Search WWH ::




Custom Search