Introduction - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

general formulation one has a learning machine whose algorithm attempts to

achieve a minimum approximation performance measure, with distinct flavors

according to the problem type (see e.g., [41]). In the present topic our sole

concern is with classification problems.

Regarding the approximation performance of classifiers, several important

points are to be noted:

1. We usually don't know F X,T , which renders impossible the integration in

formula (1.2), even in the improbable scenario that other technicalities

were easily solvable. In practical real-world problems, where one designs

z w using the n -sized dataset ( X ds ,T ds ), one can only obtain an estimate of

P e , the so-called empirical error estimate (or empirical error rate), which

is simply the average number of misclassified instances:

P e ( n )= 1

P e ( Z w )

≡

( x ) .

(1.4)

{

= z w }

∈

X ds

2. The question then arises of how large the training set size, n , should be such

that with a given confidence (say, 95% confidence) the empirical error rate

(the observed error ), P e , does not deviate from the probability of error (the

true error ), P e , more than a fixed tolerance. The answer to this question

is not dicult for model-based classifiers: just use the available techniques

of statistical estimation (namely, the computation of confidence intervals).

For data-based classifiers this generalization issue is rather more intricate.

In both cases its analysis leads one to using a “functional complexity” of

Z W adequate to the training set size. Intuitively, too much complexity of

Z W will tend to produce complex decision surfaces that tightly discrimi-

nate the training set instances, but will be unable to perform equally well

(generalize) in new independent datasets (the over-fitting phenomenon);

too less complexity will produce a classifier that generalizes well but with

poor performance, failing to extract from the data all possible information

( under-fitting ). The present topic doesn't include a discussion on the gen-

eralization issue and related issues such as dimensionality reduction and

function regularization. Theoretical and practical aspects on these topics

are discussed in detail in [52,230,41,227,11,228]. For shorter but well struc-

tured and informative surveys, see [30,233]. We will include a few remarks

on these aspects when appropriate. We will also make use of the simple

generalization measure

P e |

, which is small if the classifier generalizes

well. Note that generalization has nothing to do with P e being small; it

really means that P e is a good estimate of P e .

3. For a classifier architecture that implements

P e −

}

what is really of interest to us is not some z w but the best of them all. In

other words, one requires the learning algorithm to achieve the minimum

probability of error in

Z W =

{

z w : X

→

T ; w

∈

Z W , expressed as follows:

Find w s.t. P e ( Z w ) is minimum.

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home