Biomedical Engineering Reference
In-Depth Information
empirical errors are close to their true values, just that the smallest empirical
error is close to the smallest generalization error.
ˆ ()
= E . It is natural to assume that as our sample size
N becomes larger, our sampling error F will approach zero. (We will return to
this assumption below.) Suppose we could find a function I( N ) to bound our
sampling error, such that |F| I( N ). Then we could guarantee that our choice of
model was approximately correct ; if we wanted to be sure that our prediction
errors were within F of the best possible, we would merely need to have N (F) =
I -1 (F) data-points.
It should not be surprising to learn that we cannot, generally, make ap-
proximately correct guarantees. As the eminent forensic statistician C. Chan
remarked, "Improbable events permit themselves the luxury of occurring" (27),
and one of these indulgences could make the discrepancy between
Recall that
L
R
[ ()]
L
R F
ˆ ()
L R and
E [ L (R)] very large. But if something like the law of large numbers holds, or the
ergodic theorem (§3.2), then for every choice of R,
ˆ
Pr(|
L
( )
R
E
[
L
( )] | > )
R F
l
0,
[3]
for every positive F. 5 We should be able to find some function E such that
ˆ
Pr(|
L
( )
R
E
[
L
( )] | > )
R F E FR
b
(
N
, ,
),
[4]
with lim N E( N ,F,R) = 0. Then, for any particular R, we could give probably ap-
proximately correct (28) guarantees, and say that, e.g., to have 95% confidence
that the true error is within 0.001 of the empirical error requires at least 144,000
samples (or whatever the precise numbers may be). If we can give probably ap-
proximately correct (PAC) guarantees on the performance of one machine, we
can give them for any finite collection of machines. But if we have infinitely
many possible machines, might not there always be some of them which are
misbehaving? Can we still give PAC guarantees when R is continuous?
The answer to this question depends on how flexible the set of machines
is—its capacity . We need to know how easy it is to find a R such that f ( X ,R) will
accommodate itself to any Y . This is measured by a quantity called the Vapnik-
Chervonenkis (VC) dimension (22). 6 If the VC dimension d of a class of ma-
chines is finite, one can make a PAC guarantee that applies to all machines in
the class simultaneously:
(
)
ˆ
Pr max |
L
( )
R
E
[
L
( )] |
R I
(
N d
,
, )
E E
b
,
p
[5]
R
where the function I( N , d ,E) expresses the rate of convergence. It depends on the
particular kind of loss function involved. For example, for binary classification,
if the loss function is the fraction of inputs misclassified,
Search WWH ::




Custom Search