Information Technology Reference
In-Depth Information
Since the training set is a random variable, the general conditions that
guarantee that the limit condition is verified for any
L
M
are not trivial. Vapnik
showed that the condition is satisfied if and only if the probability of the largest
difference between both terms in vanishes uniformly:
P
sup
w
,L
M
ε
t
(
w
;
L
M
)]
>δ
=0
.
lim
M→∞
[
ε
g
(
w
)
−
The meaning of the above equation is the following: suppose that we have
all the possible training sets
L
M
of
M
examples, drawn at random with the
unknown probability
p
(
L
M
)=
k
=1
p
(
x
k
,y
k
)=
k
=1
p
(
x
k
)
P
(
y
k
x
k
). The
argument in brackets in the above convergence condition means that we deter-
mine, for each
L
M
, weights such that the difference between
ε
t
(the fraction of
misclassified patterns) and
ε
g
(the generalization error) is maximal. Loosely
speaking, the probability
P
represents the fraction of training sets for which
this difference is larger than
δ
. This means that
P
is the probability of the
worst case
, since it is the fraction of training sets that exhibit a training error
very different from the generalization error. Now, in order to guarantee the
good quality of learning, we have to be sure that
ε
t
and
ε
g
are close to each
other in all cases. That is why the worst case is considered. If the condition of
uniform convergence is verified, then
ε
t
is a good estimator of
ε
g
for all train-
ing algorithms and training sets
L
M
. That condition guarantees that weights
that minimize
ε
t
, but that endow the network with a very poor generalization
performance do not exist (in probability). Since the convergence condition is
an asymptotic law, the “guarantee” will be valid only for su
ciently large
M
.
More precisely, Vapnik proved the following inequality: for any
δ
,
|
P
sup
w
,L
M
ε
t
(
w
;
L
M
)]
>δ
4exp
−
G
(
M
))
,
(
Mδ
2
lim
M→∞
[
ε
g
(
w
)
−
≤
−
where
G
(
M
), called growth function, is an upper bound of the
logarithm
of D(M,N)
, the number of dichotomies (separations into two subsets) that
the student network can perform in dimension
N
given a training set of
M
points. We will see below how that number has been computed in the case
of the perceptron.
G
(2
M
) is an increasing function of its argument, which is
independent of the particular training set to be learnt; it depends only on the
architecture of the student network: the number of hidden units, of weights,
etc. Note that the second term in the above relation is a useful bound (
≤
1)
only for
G
(
M
)
/M < δ
2
. Thus, the above condtion is meaningful only if
G
increases slower than linearly with
M
.
To summarize, the condition of uniform convergence that guarantees good
generalization after training from a set of examples, has been transformed into
the problem of determining the student network growth function
G
(2
M
)asa
function of the size
M
of the training set. In particular, the bound proves that
the generalization error can be bounded only if
G
increases with
M
slower
that a linear function. If the growth function of the network is known, the
Search WWH ::
Custom Search