Discrimination - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

Since the training set is a random variable, the general conditions that

guarantee that the limit condition is verified for any L M are not trivial. Vapnik

showed that the condition is satisfied if and only if the probability of the largest

difference between both terms in vanishes uniformly:

P sup

w ,L M

ε t ( w ; L M )] >δ =0 .

lim

M→∞

[ ε g ( w )

−

The meaning of the above equation is the following: suppose that we have

all the possible training sets L M of M examples, drawn at random with the

unknown probability p ( L M )= k =1 p ( x k ,y k )= k =1 p ( x k ) P ( y k

x k ). The

argument in brackets in the above convergence condition means that we deter-

mine, for each L M , weights such that the difference between ε t (the fraction of

misclassified patterns) and ε g (the generalization error) is maximal. Loosely

speaking, the probability P represents the fraction of training sets for which

this difference is larger than δ . This means that P is the probability of the

worst case , since it is the fraction of training sets that exhibit a training error

very different from the generalization error. Now, in order to guarantee the

good quality of learning, we have to be sure that ε t and ε g are close to each

other in all cases. That is why the worst case is considered. If the condition of

uniform convergence is verified, then ε t is a good estimator of ε g for all train-

ing algorithms and training sets L M . That condition guarantees that weights

that minimize ε t , but that endow the network with a very poor generalization

performance do not exist (in probability). Since the convergence condition is

an asymptotic law, the “guarantee” will be valid only for su ciently large M .

More precisely, Vapnik proved the following inequality: for any δ ,

P sup

w ,L M

ε t ( w ; L M )] >δ

4exp −

G ( M )) ,

( Mδ 2

lim

M→∞

[ ε g ( w )

−

≤

−

where G ( M ), called growth function, is an upper bound of the logarithm

of D(M,N) , the number of dichotomies (separations into two subsets) that

the student network can perform in dimension N given a training set of M

points. We will see below how that number has been computed in the case

of the perceptron. G (2 M ) is an increasing function of its argument, which is

independent of the particular training set to be learnt; it depends only on the

architecture of the student network: the number of hidden units, of weights,

etc. Note that the second term in the above relation is a useful bound ( ≤ 1)

only for G ( M ) /M < δ 2 . Thus, the above condtion is meaningful only if G

increases slower than linearly with M .

To summarize, the condition of uniform convergence that guarantees good

generalization after training from a set of examples, has been transformed into

the problem of determining the student network growth function G (2 M )asa

function of the size M of the training set. In particular, the bound proves that

the generalization error can be bounded only if G increases with M slower

that a linear function. If the growth function of the network is known, the

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home