Discrimination - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

6.7.3 The Optimal Bayesian Classifier

Within the Bayesian framework, the probability that the class of a new input

pattern is σ , conditional to the fact that learning was performed with the

training set L M ,is

x ,L M )= P ( σ

P ( σ

x , w ) p ( w

L M )d w ,

where p ( w

L M )isthe posterior probability of the weights p ( w

L M ), which

in turn depends on the evidence p ( L M |

w ) and the prior p 0 ( w ).

Remark. If the classifier is deterministic, and if w learn are the weights that

minimize the cost function, as is the case for the classifiers considered in this

chapter, then p ( w

x , w learn ) is either 0 or 1.

Note that the evidence depends on the training algorithm through the weights

w learn . For a student perceptron P ( σ

L M )= δ ( w

−

w learn ), and P ( σ

x , w learn )= Θ ( σ x . w learn ). Therefore,

if x . w learn > 0, we have P ( σ =+1

x ,

w learn )=0 , and symmetrically for x . w learn < 0. The output of a Bayesian

perceptron is therefore nothing but the output of the simple perceptron with

weights w learn .

Some classifiers are not deterministic. In that case, the probability law P ( σ

x , w learn )=1and P ( σ =

−

x , w ) is different from the Heaviside function Θ assumed in this chapter. For

example, if the inputs of the perceptron are subject to additive noise η ,the

probability that the response to pattern x is σ can be written as:

P ( σ

x ,L M )= P ( σ x · w learn + δ> 0)

= P ( δ>−σ x · w learn )

= ∞

−σ x · w learn

p ( δ ) dδ,

where δ stands for ση

w learn .

Another case of non-deterministic output arises when the posterior proba-

bility of the weights p ( w

L M ), is not a delta function. For example, consider

training a perceptron with the error counting cost function, from a set of lin-

early separable examples. That cost is highly degenerate: there is a continuum

of weights that learn without errors (more generally, that continuum exists

whenever the task to be learnt can be performed by the student classifier).

Samples of those weights may be obtained using the perceptron algorithm,

since the result depends on the weights initialization and the order of the

updates. The weights that classify correctly the training patterns is a dense

subset w learn of weights that occupy a finite volume Ω in weight space. Thus,

the posterior probability p ( w

L M ) is constant in that volume and vanishes

outside. To guarantee correct normalization, p ( w

L M )= Ω − 1 . After replac-

ing in P ( σ

x ,L M ), we have

x ,L M )=

w ) Ω − 1 d w ,

P ( σ

Θ ( σ x

w ∈ w learn

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home