Discrimination - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

Remark 1. All choices made before training, such as the network architec-

ture (multilayered or not), the activation function (binary or real-valued), the

feature space for the SVM's, correspond to different priors, and are included

implicitly in p 0 ( w ).

Remark 2. Note that if the evidence is multiplicative, which is a consequence

of the assumed independence of the patterns, then the expectation of any

additive function of the examples is the sum of the expectations. This remark,

developed in the next paragraph, justifies the use of cost functions that are

sums of partial costs per example.

6.7.2 A Probabilistic Interpretation of the Perceptron Cost

Functions

Within the probabilistic framework, considering a linear student perceptron

corresponds to the implicit assumption that the discrimination problem is

linearly separable. If we also assume that the task is deterministic, then the

evidence of an example k is

P ( y k

x k , w )= Θ ( z k ) ,

where z k = y k x k ·

w is the aligned field. The expectation that a student with

weights w misclassifies example k is

ε t

Θ ( z k )+1

z k ) .

Θ (

−

Therefore, the expected number of training errors is

E = M

z k ) ,

Θ (

−

k =1

which is equal (up to an irrelevant constant factor 1 /M ) to the cost function

C ( w )=(1 /M ) k =1 V ( z k ), with V ( z )givenby V ( z )= Θ (

−

z ).

Remark. The previous relation shows that the weights that minimize C ( w )

with partial cost Θ ( −z ) are those that minimize the expected classification

errors if the classification is deterministic.

If we assume that the inputs are perturbed by an additive noise we have

x k = x k + η k , where the components of the vector η x ∈

R N are random

variables of zero mean, satisfying η i

x i . The stability of an example k is

thus γ k = γ k + δ k ,with γ k ≡

y k x k

. Then, δ k = η k

is a

random variable with zero mean and probability density function p ( δ k ). The

probability of misclassification of an example k belonging to the training set

w /

γ k )= −γ k

−∞

P ( γ k + δ k < 0) = P ( δ k <

p ( δ k ) dδ k .

−

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home