Biomedical Engineering Reference
In-Depth Information
¬ -
1
2
N
4
-
I
(,,)
Nd
E
=
d
1 ln
+
) ln .
+
®
[6]
-
- -
d
E
N
Notice that R is not an argument to I, and does not appear in [6]. The rate of
convergence is the same across all machines; this kind of result is thus called a
uniform law of large numbers . The really remarkable thing about [5] is that it
holds no matter what the sampling distribution is, so long as samples are inde-
pendent; it is a distribution-free result.
The VC bounds lead to a very nice learning scheme: simply apply empirical
risk minimization, for a fixed class of machines, and then give a PAC guarantee
that the one picked is, with high reliability, very close to the actual optimal
machine. The VC bounds also lead an appealing penalization scheme, where the
penalty is equal to our bound on the over-fitting, I. Specifically, we set the term
M d (R) in [1] equal to the I in [5], ensuring, with high probability, that the F and
M d (R) terms in [2] cancel each other. This is structural risk minimization
(SRM).
It's important to realize that the VC dimension is not the same as the num-
ber of parameters. For some classes of functions, it is much lower than the num-
ber of parameters, and for others it's much higher . (There are examples of one-
parameter classes of functions with infinite VC dimension.) Determining the VC
dimension often involves subtle combinatorial arguments, but many results are
now available in the literature, and more are appearing all the time. There are
even schemes for experimentally estimating the VC dimension (29).
Two caveats are in order. First, because the VC bounds are distribution-
free, they are really about the rate of convergence under the worst possible dis-
tribution, the one a malicious adversary out to foil our data mining would
choose. This means that in practice, convergence is often much faster than [5]
would indicate. Second, the usual proofs of the VC bounds all assume independ-
ent, identically distributed samples, though the relationship between X and Y can
involve arbitrarily complicated dependencies. 7 Recently, there has been much
progress in proving uniform laws of large numbers for dependent sequences of
samples, and structural risk minimization has been extended to what are called
"mixing" processes (30), in effect including an extra term in the I function ap-
pearing in [5] that discounts the number of observations by their degree of mu-
tual dependence.
2.2. Choice of Architecture
The basic idea of data mining is to fit a model to data with minimal assump-
tions about what the correct model should be, or how the variables in the
data are related. (This differs from such classical statistical questions as testing
Search WWH ::




Custom Search