Information Technology Reference
In-Depth Information
6.7 Theoretical Questions
6.7.1 The Probabilistic Framework
Learning from examples makes sense only if there is some regularity in the
data. Within the statistical formulation of training, it is generally assumed
that the patterns are
pairs drawn independently at random
from an unknown probability distribution p ( x ,y ). In particular, the probabil-
ity of the learning set L M
{
input-output
}
is
p ( L M )= M
p ( x k ,y k )= M
p ( x k ) P ( y k
x k ) .
|
k =1
k =1
The second term above corresponds to the following process: first the input
x k is drawn at random with probability density p ( x k ); given x k , the class y k
is selected with a conditional probability P ( y k
x k ). The case of deterministic
classes considered in this chapter is just a particular case of this formulation.
|
Remark. The “teacher-student” paradigm, suggested in Chap. 2 for regres-
sion testing, is frequently used in classification theory. It is usually assumed
that the components of the input patterns are either Gaussian variables:
2 π exp
,
x i 2
2
1
p ( x i )=
a,a ]: p ( x i )=1 / 2 a .
Then, the classes of the input vectors x k are defined by a “teacher” network
of weights w . For example, if the teacher is a deterministic perceptron, one
has P ( y k
or uniformly distributed variables within some interval [
x k ). The aim of learning is to find weights w that
convey good generalization properties to the “student”. Besides the examples
of L M , the “student” is expected to classify correctly any pattern drawn at
random with the same probability p ( x ) as the training set.
x k )= Θ ( y k w ·
|
Because the training set L M is probabilistic, the student weights w depend
on the particular realization of L M . Therefore, w is a random variable. In this
paragraph we apply the method of Bayesian inference to the determination
of the probability distribution p ( w
L M ). This method is based on Bayes
theorem, introduced in Chap. 1, which can formally be written as follows:
|
p ( w | L M ) P B ( L M )= P ( L M | w ) p 0 ( w ) ,
where P B ( L M ) is defined below; p 0 ( w ) is the a priori probability of the clas-
sifier parameters (the weights in the case of neural networks) before learning,
and P ( L M |
w ), called evidence, is the probability of the training set L M when
the student has weights w . The a posteriori probability density function for
the student weights is
Search WWH ::




Custom Search