Information Technology Reference
In-Depth Information
bound gives the expected degree of confidence on the classification of new
inputs, since from ε t
and M , which are both available in practice, one can
bound ε g .
6.7.4.1 The Vapnik-Chervonenkis Dimension
The question of bounding the generalization error of a network learning M
examples in N dimensions has been reduced to that of how its growth function
G varies with the size M of the training set. More precisely, exp G [ M ]isan
upper bound to the logarithm of the number D ( M,N ; L M ) of dichotomies
realizable by the student network. In other words,
G ( M )=lnsup
L M
D ( M,N ; L M ) .
Thus, we have to count the number of dichotomies of M points that the
network can perform. One dichotomy of a set L M of M points is a separation
of L M into two subsets. For example, there are 2 M possible dichotomies of M
points in input space. Those correspond to all the ways of assigning classes
1
to the examples. If the network is able to implement all of them, then G ( M )=
M log 2 ∝ M , and the bound lim M→∞ P [sup w ,L M [ ε g ( w ) −ε t ( w ; L M )] ]
4 exp[ ( 2
±
−G (2 M ))] is useless. Clearly, if M is small enough, then a per-
ceptron is able to perform all the dichotomies. For example, two patterns in
dimension N = 2 can always be separated by a perceptron. Even three exam-
ples are separable into all the possible dichotomies, provided the points are
not aligned. In a space of larger dimension N , that non-alignment condition
is called condition of general position (which is equivalent to asking that no
subset of more than N points lies on a hyperplane). Coming back to two di-
mensions, it is easy to verify that beyond three points, only a fraction of all
the possible dichotomies is linearly separable. If all 2 M dichotomies are realiz-
able, the network is said to learn by heart ; in that case G ( M )
means proportional) and the bound is useless. That result can be understood
intuitively: if the network can learn by heart, either it has too many parame-
ters (in which case G ( M ) is too large) or M is too small. In both cases, the
information conveyed by the training set is not su cient to compute weights
with good generalization properties.
In general, whatever the network architecture, there is a maximal number
of examples M VC , called Vapnik-Chervonenkis (or VC) dimension that the
student can learn by heart. Beyond that number, the network can only perform
a subset of all the possible dichotomies. That is, only for M>M VC , G ( M )
grows more slowly than M and lim M→∞ P [sup w ,L M [ ε g ( w )
M (where
ε t ( w ; L M )] >
( 2
δ ]
4 exp[
G (2 M ))] becomes a true bound. The behavior of G is
the following:
M
if M<M VC
G ( M )
M VC ln M
M VC
if M>M VC .
Search WWH ::




Custom Search