Information Technology Reference
In-Depth Information
Since the training set is a random variable, the general conditions that
guarantee that the limit condition is verified for any L M are not trivial. Vapnik
showed that the condition is satisfied if and only if the probability of the largest
difference between both terms in vanishes uniformly:
P sup
w ,L M
ε t ( w ; L M )] =0 .
lim
M→∞
[ ε g ( w )
The meaning of the above equation is the following: suppose that we have
all the possible training sets L M of M examples, drawn at random with the
unknown probability p ( L M )= k =1 p ( x k ,y k )= k =1 p ( x k ) P ( y k
x k ). The
argument in brackets in the above convergence condition means that we deter-
mine, for each L M , weights such that the difference between ε t (the fraction of
misclassified patterns) and ε g (the generalization error) is maximal. Loosely
speaking, the probability P represents the fraction of training sets for which
this difference is larger than δ . This means that P is the probability of the
worst case , since it is the fraction of training sets that exhibit a training error
very different from the generalization error. Now, in order to guarantee the
good quality of learning, we have to be sure that ε t and ε g are close to each
other in all cases. That is why the worst case is considered. If the condition of
uniform convergence is verified, then ε t is a good estimator of ε g for all train-
ing algorithms and training sets L M . That condition guarantees that weights
that minimize ε t , but that endow the network with a very poor generalization
performance do not exist (in probability). Since the convergence condition is
an asymptotic law, the “guarantee” will be valid only for su ciently large M .
More precisely, Vapnik proved the following inequality: for any δ ,
|
P sup
w ,L M
ε t ( w ; L M )]
4exp
G ( M )) ,
( 2
lim
M→∞
[ ε g ( w )
where G ( M ), called growth function, is an upper bound of the logarithm
of D(M,N) , the number of dichotomies (separations into two subsets) that
the student network can perform in dimension N given a training set of M
points. We will see below how that number has been computed in the case
of the perceptron. G (2 M ) is an increasing function of its argument, which is
independent of the particular training set to be learnt; it depends only on the
architecture of the student network: the number of hidden units, of weights,
etc. Note that the second term in the above relation is a useful bound ( 1)
only for G ( M ) /M < δ 2 . Thus, the above condtion is meaningful only if G
increases slower than linearly with M .
To summarize, the condition of uniform convergence that guarantees good
generalization after training from a set of examples, has been transformed into
the problem of determining the student network growth function G (2 M )asa
function of the size M of the training set. In particular, the bound proves that
the generalization error can be bounded only if G increases with M slower
that a linear function. If the growth function of the network is known, the
Search WWH ::




Custom Search