Information Technology Reference
In-Depth Information
Probabilistic Connection between Cross-Validation
and Vapnik Bounds
Przemysław Klesk
Department of Methods of Artificial Intelligence and Applied Mathematics
Westpomeranian University of Technology
ul. Zolnierska 49, Szczecin, Poland
pklesk@wi.zut.edu.pl
Abstract.
In the paper we analyze a connection between outcomes of the cross-
validation procedure and Vapnik bounds [1,2] on generalization of learning
machines. We do
not
focus on how well the measured cross-validation outcome
estimates the generalization error or how far it is from the training error; instead,
we want to make statements about the cross-validation result without actually
measuring it. In particular we want to state probabilistically what
-difference one
can expect between the
known
Vapnik bound and the
unknown
cross-validation
result for given conditions of the experiment. In the consequence, we are able to
calculate the necessary size of the training sample, so that the ε is sufficiently
small; and so that the optimal complexity indicated via SRM is acceptable in
the sense that cross-validation, if performed, would probably indicate the same
complexity. We consider a non-stratified variant of cross-validation, which is con-
venient for the main theorem.
ε
Keywords.
Statistical learning theory, Bounds on generalization, Cross valida-
tion, Empirical risk minimization, Structural risk minimization, Vapnik-Chervon-
enkis dimension.
1
Introduction and Notation
One part of the
Statistical Learning Theory
developed by Vapnik [1,2,3] is the
theory
of bounds
. It provides probabilistic bounds on generalization of learning machines. The
key mathematical tools applied to derive the bounds in their additive versions are Cher-
noff and Hoeffding inequalities
1
[2,4,5,6].
We use this theory to show a probabilistic relationship between two approaches for
complexity selection:
n-fold cross-validation
(popular among practioner modelers) and
Structural Risk Minimization
proposed by Vapnik (rarely met in practice) [7,8,9,10]. We
remind that SRM is
O
(
n
)
times faster than
n
-fold cross-validation (since SRM does not
perform any repetitions/folds per single fixed complexity, nor testing) but less accurate,
Chernoff inequality:
P
|
ν
I
−
p
|≥
ε
≤
2exp
(
−
2ε
2
I
)
, Hoeffding inequality:
P
|
X
I
−
EX
|≥
1
ε
≤
2exp
(
−
2
I
2ε
A
2
)
. Meaning (respectively): observed frequencies on a sample of size
I
converge to the true probability as
I
grows large; analogically: means of a random variable
(bounded by
A
and
B
) converge to the expected value. It is a
in-probability-convergence
and
its rate is exponential.
B
2
−