Information Technology Reference
In-Depth Information
In this case, the classifier outputs are estimates of the posterior probabilities
of the input data (probability that the input pattern x belongs to class ω k ,
coded as t k ). The classifier (either neural network or of other type) is then
able to attain the optimal Bayes error performance.
This result for L SE can in fact be generalized to other loss functions L ( t, y )
as far as they satisfy the following three conditions [193]: a) L ( t, y )=0,iff
t = y ;b) L ( t, y ) > 0,if t
= y ;c) L ( t, y ) is twice continuously differentiable.
In general practice, however, this radiant scenario is far from being met
for the following main reasons:
1. The classifier must be able to provide a good approximation of the condi-
tional expectations
x ]. This may imply a more complex architecture
of the classifier (e.g., more hidden neurons in the case of MLPs) than is
adequate for a good generalization of its performance.
2. The training algorithm must be able to reach the minimum of R MSE .This
is a thorny issue, since one will never know whether the training process
converged to a global minimum or to a local minimum instead.
3. For simple artificial problems it may be possible to generate enough data
instances so that one is sure to be near the asymptotic result (2.8), cor-
responding to infinite instances. One would then obtain good estimates of
the posterior probabilities [185, 78]. However, in non-trivial classification
problems with real-world datasets, one may be operating far away of the
convergence solution.
4. Finally, the above results have been derived under the assumption of noise-
freedata.Innormalpractice, however, one may expect some amount of
noise in both the data and the target values. To give an example, when a
supervisor has to label instances near the decision surface separating the
various classes it is not uncommon that some mistakes crawl in.
We now present two simple examples. The first one illustrates the drastic
influence that a small sample size, or a tiny amount of input noise, may have in
finding the min P e solution with MSE. The second one illustrates the fact that
a very small deviation of the posterior probability values may, nonetheless,
provoke an important change in the min P e value. This is a consequence of
the integral nature of P e , and should caution us as to the importance of the
“posterior probability approximation” criterion.
E
[ T k |
Example 2.1. The dataset (inspired by an illustration in [41]) in this two-
class example with T =
is shown in Fig. 2.1; corresponds to uniform
distributions for both classes: the x 1 - x 2 domain is [
{
0 , 1
}
3 ,
0 . 05]
×
[0 , 0 . 15]
[
0 . 5 ,
0 . 05]
×
[0 . 15 , 1] for class 0 and reflected around the (0 , 0 . 5) point for
class 1.
Let us assume a regression-like classifier implementing the thresholded
linear family
with f w ( x )= w 0 + w 1 x 1 + w 2 x 2 and
θ ( y )= h ( y +0 . 5),where h is the Heaviside step. Using the MSE risk, one may
apply the direct parameter estimation algorithm amounting to solving the
normal equations for the linear regression problem. Once w =( w 0 ,w 1 ,w 2 ),
F W
=
{
θ ( f w ( x ))
}
 
Search WWH ::




Custom Search