Information Technology Reference
In-Depth Information
L M )= P ( L M
w ) p 0 ( w )
P B ( L M )
|
p ( w
|
,
where
P B ( L M )= d w P ( L M |
w ) p 0 ( w )
guarantees the correct normalization of p ( w
L M ). It is the marginal proba-
bility of the examples within the type of students corresponding to prior p 0 .
Depending on the hypothesis implicit in the choices of the prior p 0 ( w )and
the evidence P ( L M
|
w ), different Bayesian inferences will be obtained.
Remark. The expression of the a posteriori probability density function for
the student weights is Bayes formula applied to the classifier parameters , con-
sidered as random variables that depend on the realizations of the training
set. Note that, in Chap. 1, Bayes formula was applied to the pattern classes ,
considered as random variables depending on the realizations of the vector
of descriptors x . Those are two different applications of Bayes formula, both
within the field of patterns classification.
|
Usual priors for perceptrons are the Gaussian prior,
(2 π ) N/ 2 exp
,
2
1
w
p 0 ( w )=
2
and the uniform prior on the surface of a hypersphere of radius equal to the
norm of the weight vector. For example,
2
p 0 ( w )= δ (
w
1)
imposes a unitary norm. In the case of a student perceptron that performs
linear discriminations, the above relation is an appropriate choice, since we
have already seen that only the orientation of w must be learnt. Note that the
above priors do not introduce any information. In the case of the Gaussian
priors, it amounts to assume that any weight vector has a non-vanishing prob-
ability, with a preference for weights of small norm. With the uniform prior,
all orientations have the same probability. Any additional information about
the problem should be included in the prior, through an educated choice of
p 0 ( w ). The other term of the a posteriori probability density function for
the student weights that must be provided is the evidence. It contains the
information about the examples of the learning problem. If the examples are
independent, one can write
w )= M
P ( y k
x k , w ) p ( x k ) ,
P ( L M
|
|
k =1
where p ( x k ) is the probability density of the input vectors. P ( y k
x k , w ),
the evidence for example k , is the probability that a network with weights w
assigns the correct class y k to the input x k belonging to L M .
|
 
Search WWH ::




Custom Search