Information Technology Reference
In-Depth Information
where the integral is over the volume . The class assigned by the optimal
Bayesian classifier to an input is such that the posterior probability of the
class, P ( σ
x , L M ) is maximal. In the particular case of the perceptron, all
the weights in volume are equiprobable, and the Bayesian prescription is
equivalent to a vote: it assigns to each new input x the class σ that maximizes
P ( σ
|
x , L M ), the optimal Bayesian
decision is that the class of x is σ = +1, otherwise it is σ =
|
x ,L M ). If P (+1
|
x , L M ) >P (
1
|
1. That result
was mentioned in Chap. 1 under the term “Bayes decision rule”. Notice that
it takes this simple form only because p ( w
L M ) is a constant.
Remark. In the particular case of a simple perceptron that learns a linearly
separable task, the optimal Bayesian decision is to classify the new input like
the majority of the possible error-free solutions w learn ( L M ). In other words,
the Bayesian decision in this case is the winner-takes-all solution.
|
6.7.4 Vapnik's Statistical Learning Theory
In this paragraph we present some results of Vapnik's statistical learning the-
ory, without proofs (see [Vapnik 1995]). The main question addressed by this
theory is: are there conditions that guarantee that the minimization of the
number of training errors provides the classifier with a minimal generalization
error, when the distribution p ( x ,y ) is unknown? In other words, under such
conditions, the weights that minimize the training error ε t should also min-
imize the generalization error ε g , irrespective of the probability distribution
of the patterns and of the particular realization of the training set. Clearly,
a necessary condition that must hold is that the minimum of the training
error or empirical risk (which is the quantity actually minimized through the
learning process) tends to the generalization error, called functional risk (the
quantity we would like to minimize) when the number of training examples
increases. More specifically, the following limit must be obeyed:
lim
M→∞
ε t ( w learn ; L M )
inf
w
ε g ( w ) ,
where w learn are the classifier weights that minimize the number of training
errors. If the above relation is satisfied, then the training error is a good esti-
mator of the generalization error: minimizing the former is a good strategy to
minimize the latter. Notice that when the student architecture is well adapted
to the task, the right-hand term vanishes. This is indeed the case for a percep-
tron learning a linearly separable task. We have mentioned previously that,
in such a case, there are infinitely many weights that cancel ε t . In fact, there
is a finite volume of solutions in weight space. In that case, the condition
is satisfied by any algorithm able to find a linear separation. However, in gen-
eral the student architecture is not perfectly adapted to the learning problem.
Then inf w ε g ( w )
= 0, and it is di cult to guarantee that an algorithm will
find the weights that satisfy the limit, especially if the cost function presents
local minima.
Search WWH ::




Custom Search