Information Technology Reference
In-Depth Information
) n :
After n independent folds we obtain, with probability at least ( 1
2
η
k = 1 R emp ( ω I , k )+ ln N ln η
n
k = 1 R emp ( ω I , k )
n
1
n
1
n
2 I
C
+ ln η
2 I
. (23)
To conclude the proof, we need to relate somehow R emp ( ω I , k ) from each fold to
R emp (
ω I ) . We need the relation in the direction R emp (
, so that we can plug
the right-hand-side of it into (23) and keep it true. Intuitively, one might expect that
choosing an optimal function on a larger sample leads to a greater empirical risk com-
paring to a smaller sample, i.e. R emp (
ω I , k )
≤···
R emp (
ω I , k ) , because it is usually easier to
fit fewer data points using models of equally rich complexities. But we don't know with
what probability that occurs. Contrarily, on may easily find a specific data subset for
which R emp (
ω I )
R emp (
ω I )
ω I , k ) .
Lemma 1. With probability 1 , true is the following inequality:
I
i = 1 Q ( z i , ω I )
I
i = 1 Q ( z i , ω I ) .
(24)
On the level of sums of errors, not means, the total error for a larger sample will always
surpass the total error for a smaller sample. This gives us I R emp (
ω I )
IR emp (
ω I ) and
further:
n
R emp (
ω I )
1 R emp (
ω I ) .
(25)
n
n
n
Unfortunately it is of no use, because of the coefficient
1 . Thinking of C
V in the
theorem, we need a relation with coefficients 1 at both C and V .
In [2, pp. 124] we find the following helpful assertion:
Lemma 2. With probability at least 1
2
η
:
Q ( z ,
ω I ) dF ( z )
inf
Q ( z ,
ω j ) dF ( z )
1
j
N
Z
Z
R (
ω 0 )
ln N
+
ln
η
ln
η
(26)
2 I
2 I
— the true risk for the selected function
ω I is not farther from the minimal possible risk
for this set of functions than ln N ln η
2 I
+ ln η
2 I
.
Proof of that statement given by Vapnik is based on two inequalities (each with prob-
ability at least 1
), the first is (13) — we repeat it here, and the second is Chernoff
inequality for the best function
η
ω 0 :
 
Search WWH ::




Custom Search