Discrimination - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

bound gives the expected degree of confidence on the classification of new

inputs, since from ε t

and M , which are both available in practice, one can

bound ε g .

6.7.4.1 The Vapnik-Chervonenkis Dimension

The question of bounding the generalization error of a network learning M

examples in N dimensions has been reduced to that of how its growth function

G varies with the size M of the training set. More precisely, exp G [ M ]isan

upper bound to the logarithm of the number D ( M,N ; L M ) of dichotomies

realizable by the student network. In other words,

G ( M )=lnsup

L M

D ( M,N ; L M ) .

Thus, we have to count the number of dichotomies of M points that the

network can perform. One dichotomy of a set L M of M points is a separation

of L M into two subsets. For example, there are 2 M possible dichotomies of M

points in input space. Those correspond to all the ways of assigning classes

to the examples. If the network is able to implement all of them, then G ( M )=

M log 2 ∝ M , and the bound lim M→∞ P [sup w ,L M [ ε g ( w ) −ε t ( w ; L M )] >δ ] ≤

4 exp[ − ( Mδ 2

−G (2 M ))] is useless. Clearly, if M is small enough, then a per-

ceptron is able to perform all the dichotomies. For example, two patterns in

dimension N = 2 can always be separated by a perceptron. Even three exam-

ples are separable into all the possible dichotomies, provided the points are

not aligned. In a space of larger dimension N , that non-alignment condition

is called condition of general position (which is equivalent to asking that no

subset of more than N points lies on a hyperplane). Coming back to two di-

mensions, it is easy to verify that beyond three points, only a fraction of all

the possible dichotomies is linearly separable. If all 2 M dichotomies are realiz-

able, the network is said to learn by heart ; in that case G ( M )

∝

means proportional) and the bound is useless. That result can be understood

intuitively: if the network can learn by heart, either it has too many parame-

ters (in which case G ( M ) is too large) or M is too small. In both cases, the

information conveyed by the training set is not su cient to compute weights

with good generalization properties.

In general, whatever the network architecture, there is a maximal number

of examples M VC , called Vapnik-Chervonenkis (or VC) dimension that the

student can learn by heart. Beyond that number, the network can only perform

a subset of all the possible dichotomies. That is, only for M>M VC , G ( M )

grows more slowly than M and lim M→∞ P [sup w ,L M [ ε g ( w )

∝

M (where

−

ε t ( w ; L M )] >

( Mδ 2

δ ]

≤

4 exp[

−

G (2 M ))] becomes a true bound. The behavior of G is

the following:

⎧

⎨

if M<M VC

G ( M ) ∝

M VC ln M

M VC

⎩

if M>M VC .

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home