Discrimination - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

where the integral is over the volume Ω . The class assigned by the optimal

Bayesian classifier to an input is such that the posterior probability of the

class, P ( σ

x , L M ) is maximal. In the particular case of the perceptron, all

the weights in volume Ω are equiprobable, and the Bayesian prescription is

equivalent to a vote: it assigns to each new input x the class σ that maximizes

P ( σ

x , L M ), the optimal Bayesian

decision is that the class of x is σ = +1, otherwise it is σ =

x ,L M ). If P (+1

x , L M ) >P (

−

1. That result

was mentioned in Chap. 1 under the term “Bayes decision rule”. Notice that

it takes this simple form only because p ( w

−

L M ) is a constant.

Remark. In the particular case of a simple perceptron that learns a linearly

separable task, the optimal Bayesian decision is to classify the new input like

the majority of the possible error-free solutions w learn ( L M ). In other words,

the Bayesian decision in this case is the winner-takes-all solution.

6.7.4 Vapnik's Statistical Learning Theory

In this paragraph we present some results of Vapnik's statistical learning the-

ory, without proofs (see [Vapnik 1995]). The main question addressed by this

theory is: are there conditions that guarantee that the minimization of the

number of training errors provides the classifier with a minimal generalization

error, when the distribution p ( x ,y ) is unknown? In other words, under such

conditions, the weights that minimize the training error ε t should also min-

imize the generalization error ε g , irrespective of the probability distribution

of the patterns and of the particular realization of the training set. Clearly,

a necessary condition that must hold is that the minimum of the training

error or empirical risk (which is the quantity actually minimized through the

learning process) tends to the generalization error, called functional risk (the

quantity we would like to minimize) when the number of training examples

increases. More specifically, the following limit must be obeyed:

lim

M→∞

ε t ( w learn ; L M )

→

inf

ε g ( w ) ,

where w learn are the classifier weights that minimize the number of training

errors. If the above relation is satisfied, then the training error is a good esti-

mator of the generalization error: minimizing the former is a good strategy to

minimize the latter. Notice that when the student architecture is well adapted

to the task, the right-hand term vanishes. This is indeed the case for a percep-

tron learning a linearly separable task. We have mentioned previously that,

in such a case, there are infinitely many weights that cancel ε t . In fact, there

is a finite volume Ω of solutions in weight space. In that case, the condition

is satisfied by any algorithm able to find a linear separation. However, in gen-

eral the student architecture is not perfectly adapted to the learning problem.

Then inf w ε g ( w )

= 0, and it is di cult to guarantee that an algorithm will

find the weights that satisfy the limit, especially if the cost function presents

local minima.

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home