Information Technology Reference
In-Depth Information
6.7.3 The Optimal Bayesian Classifier
Within the Bayesian framework, the probability that the class of a new input
pattern is σ , conditional to the fact that learning was performed with the
training set L M ,is
x ,L M )= P ( σ
P ( σ
|
|
x , w ) p ( w
|
L M )d w ,
where p ( w
|
L M )isthe posterior probability of the weights p ( w
|
L M ), which
in turn depends on the evidence p ( L M |
w ) and the prior p 0 ( w ).
Remark. If the classifier is deterministic, and if w learn are the weights that
minimize the cost function, as is the case for the classifiers considered in this
chapter, then p ( w
x , w learn ) is either 0 or 1.
Note that the evidence depends on the training algorithm through the weights
w learn . For a student perceptron P ( σ
|
L M )= δ ( w
w learn ), and P ( σ
|
|
x , w learn )= Θ ( σ x . w learn ). Therefore,
if x . w learn > 0, we have P ( σ =+1
x ,
w learn )=0 , and symmetrically for x . w learn < 0. The output of a Bayesian
perceptron is therefore nothing but the output of the simple perceptron with
weights w learn .
Some classifiers are not deterministic. In that case, the probability law P ( σ
|
x , w learn )=1and P ( σ =
1
|
|
x , w ) is different from the Heaviside function Θ assumed in this chapter. For
example, if the inputs of the perceptron are subject to additive noise η ,the
probability that the response to pattern x is σ can be written as:
P ( σ
|
x ,L M )= P ( σ x · w learn + δ> 0)
= P ( δ>−σ x · w learn )
=
−σ x · w learn
p ( δ ) dδ,
where δ stands for ση
w learn .
Another case of non-deterministic output arises when the posterior proba-
bility of the weights p ( w
·
L M ), is not a delta function. For example, consider
training a perceptron with the error counting cost function, from a set of lin-
early separable examples. That cost is highly degenerate: there is a continuum
of weights that learn without errors (more generally, that continuum exists
whenever the task to be learnt can be performed by the student classifier).
Samples of those weights may be obtained using the perceptron algorithm,
since the result depends on the weights initialization and the order of the
updates. The weights that classify correctly the training patterns is a dense
subset w learn of weights that occupy a finite volume in weight space. Thus,
the posterior probability p ( w
|
L M ) is constant in that volume and vanishes
outside. To guarantee correct normalization, p ( w
|
L M )= 1 . After replac-
|
ing in P ( σ
|
x ,L M ), we have
x ,L M )=
w ) 1 d w ,
P ( σ
|
Θ ( σ x
·
w w learn
Search WWH ::




Custom Search