Continuous Risk Functionals - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

In this case, the classifier outputs are estimates of the posterior probabilities

of the input data (probability that the input pattern x belongs to class ω k ,

coded as t k ). The classifier (either neural network or of other type) is then

able to attain the optimal Bayes error performance.

This result for L SE can in fact be generalized to other loss functions L ( t, y )

as far as they satisfy the following three conditions [193]: a) L ( t, y )=0,iff

t = y ;b) L ( t, y ) > 0,if t

= y ;c) L ( t, y ) is twice continuously differentiable.

In general practice, however, this radiant scenario is far from being met

for the following main reasons:

1. The classifier must be able to provide a good approximation of the condi-

tional expectations

x ]. This may imply a more complex architecture

of the classifier (e.g., more hidden neurons in the case of MLPs) than is

adequate for a good generalization of its performance.

2. The training algorithm must be able to reach the minimum of R MSE .This

is a thorny issue, since one will never know whether the training process

converged to a global minimum or to a local minimum instead.

3. For simple artificial problems it may be possible to generate enough data

instances so that one is sure to be near the asymptotic result (2.8), cor-

responding to infinite instances. One would then obtain good estimates of

the posterior probabilities [185, 78]. However, in non-trivial classification

problems with real-world datasets, one may be operating far away of the

convergence solution.

4. Finally, the above results have been derived under the assumption of noise-

freedata.Innormalpractice, however, one may expect some amount of

noise in both the data and the target values. To give an example, when a

supervisor has to label instances near the decision surface separating the

various classes it is not uncommon that some mistakes crawl in.

We now present two simple examples. The first one illustrates the drastic

influence that a small sample size, or a tiny amount of input noise, may have in

finding the min P e solution with MSE. The second one illustrates the fact that

a very small deviation of the posterior probability values may, nonetheless,

provoke an important change in the min P e value. This is a consequence of

the integral nature of P e , and should caution us as to the importance of the

“posterior probability approximation” criterion.

[ T k |

Example 2.1. The dataset (inspired by an illustration in [41]) in this two-

class example with T =

is shown in Fig. 2.1; corresponds to uniform

distributions for both classes: the x 1 - x 2 domain is [

{

0 , 1

}

−

3 ,

−

0 . 05]

[0 , 0 . 15]

∪

[

−

0 . 5 ,

−

0 . 05]

[0 . 15 , 1] for class 0 and reflected around the (0 , 0 . 5) point for

class 1.

Let us assume a regression-like classifier implementing the thresholded

linear family

with f w ( x )= w 0 + w 1 x 1 + w 2 x 2 and

θ ( y )= h ( y +0 . 5),where h is the Heaviside step. Using the MSE risk, one may

apply the direct parameter estimation algorithm amounting to solving the

normal equations for the linear regression problem. Once w =( w 0 ,w 1 ,w 2 ),

F W

{

θ ( f w ( x ))

}

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home