Information Technology Reference
In-Depth Information
Remark 1. All choices made before training, such as the network architec-
ture (multilayered or not), the activation function (binary or real-valued), the
feature space for the SVM's, correspond to different priors, and are included
implicitly in p 0 ( w ).
Remark 2. Note that if the evidence is multiplicative, which is a consequence
of the assumed independence of the patterns, then the expectation of any
additive function of the examples is the sum of the expectations. This remark,
developed in the next paragraph, justifies the use of cost functions that are
sums of partial costs per example.
6.7.2 A Probabilistic Interpretation of the Perceptron Cost
Functions
Within the probabilistic framework, considering a linear student perceptron
corresponds to the implicit assumption that the discrimination problem is
linearly separable. If we also assume that the task is deterministic, then the
evidence of an example k is
P ( y k
x k , w )= Θ ( z k ) ,
|
where z k = y k x k ·
w is the aligned field. The expectation that a student with
weights w misclassifies example k is
ε t
Θ ( z k )+1
z k ) .
=0
·
·
Θ (
Therefore, the expected number of training errors is
E = M
z k ) ,
Θ (
k =1
which is equal (up to an irrelevant constant factor 1 /M ) to the cost function
C ( w )=(1 /M ) k =1 V ( z k ), with V ( z )givenby V ( z )= Θ (
z ).
Remark. The previous relation shows that the weights that minimize C ( w )
with partial cost Θ ( −z ) are those that minimize the expected classification
errors if the classification is deterministic.
If we assume that the inputs are perturbed by an additive noise we have
x k = x k + η k , where the components of the vector η x
R N are random
variables of zero mean, satisfying η i
x i . The stability of an example k is
thus γ k = γ k + δ k ,with γ k
y k x k
. Then, δ k = η k
is a
random variable with zero mean and probability density function p ( δ k ). The
probability of misclassification of an example k belonging to the training set
is
·
w /
w
·
w /
w
γ k )= −γ k
−∞
P ( γ k + δ k < 0) = P ( δ k <
p ( δ k ) k .
Search WWH ::




Custom Search