Information Technology Reference
In-Depth Information
)
n
:
After
n
independent folds we obtain, with probability at least
(
1
−
2
η
k
=
1
R
emp
(
ω
I
,
k
)+
ln
N
−
ln η
n
k
=
1
R
emp
(
ω
I
,
k
)
n
1
n
1
n
≤
2
I
C
+
−
ln η
2
I
.
(23)
To conclude the proof, we need to relate somehow
R
emp
(
ω
I
,
k
)
from each fold to
R
emp
(
ω
I
)
. We need the relation in the direction
R
emp
(
, so that we can plug
the right-hand-side of it into (23) and keep it true. Intuitively, one might expect that
choosing an optimal function on a larger sample leads to a greater empirical risk com-
paring to a smaller sample, i.e.
R
emp
(
ω
I
,
k
)
≤···
R
emp
(
ω
I
,
k
)
, because it is usually easier to
fit fewer data points using models of equally rich complexities. But we don't know with
what probability that occurs. Contrarily, on may easily find a specific data subset for
which
R
emp
(
ω
I
)
≥
R
emp
(
ω
I
)
≤
ω
I
,
k
)
.
Lemma 1.
With probability
1
, true is the following inequality:
I
i
=
1
Q
(
z
i
,
ω
I
)
≤
I
i
=
1
Q
(
z
i
,
ω
I
)
.
(24)
On the level of sums of errors, not means, the total error for a larger sample will always
surpass the total error for a smaller sample. This gives us
I
R
emp
(
ω
I
)
≤
IR
emp
(
ω
I
)
and
further:
n
R
emp
(
ω
I
)
≤
1
R
emp
(
ω
I
)
.
(25)
n
−
n
n
−
Unfortunately it is of no use, because of the coefficient
1
. Thinking of
C
−
V
in the
theorem, we need a relation with coefficients 1 at both
C
and
V
.
In [2, pp. 124] we find the following helpful assertion:
−
Lemma 2.
With probability at least
1
2
η
:
Q
(
z
,
ω
I
)
dF
(
z
)
−
inf
Q
(
z
,
ω
j
)
dF
(
z
)
1
≤
j
≤
N
Z
Z
R
(
ω
0
)
ln
N
+
−
ln
η
−
ln
η
≤
(26)
2
I
2
I
— the true risk for the selected function
ω
I
is not farther from the minimal possible risk
for this set of functions than
ln
N
−
ln η
2
I
+
−
ln η
2
I
.
Proof of that statement given by Vapnik is based on two inequalities (each with prob-
ability at least 1
), the first is (13) — we repeat it here, and the second is Chernoff
inequality for the best function
−
η
ω
0
: