MEE with Continuous Errors - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

Let us consider a classifier designed by empirical risk minimization using

some ( X ds ,T ds ) training set. We now denote by n d the number of ( X ds ,T ds )

instances. The classifier error rate estimated by error counting in ( X ds ,T ds )

(empirical error formula (1.4)) is then denoted P ed ( n d ). The generalization

ability of the classifier is assessed by measuring its error rate in some n t -

sized test set, ( X ts ,T ts ), assumed to be independently drawn from the same

F X,T distribution; the respective error rate is P et ( n t ). P ed ( n d ) and P et ( n t ) are

single realizations of two r.v.s; we are then interested in the main statistics of

these r.v.s, namely their expected values characterizing the average training

set and test set behaviors of the classifier. Usually, there are no closed-form

expressions of these expected values (an exception is the two-class setting

with equal-variance Gaussian inputs; see [74]) and one has to use sample

mean estimates, which we denote P ed ( n d ) and P et ( n t ), respectively.

If, for some classifier designed with n d instances, one computes P et ( n t )

with a suciently large number of ( X ts ,T ts ) sets s.t. P et ( n t )

[ P et ( n t )],

then P et ( n t ) is also a close estimation of P e ( n d ) the true error of the classifier

(see e.g., [76]). On the other hand, for a suciently large number of ( X ds ,T ds )

sets, P ed ( n d )

≈ E

[ P ed ( n d )] will exhibit a bias monotonically decreasing with

n d relative to min P e , for consistent learning algorithms.

Unfortunately, in the usual practice one doesn't have the possibility of

drawing large numbers of independent training sets and test sets. In the usual

practice, one is given an n -sized ( X dt ,T dt ) set which will be used for both

classifier design and classifier evaluation. One is then compelled to partition

( X dt ,T dt ) into training sets and test sets. There are a number of ways of

randomly selecting n d and n t instances for the computation of P ed ( n d ) and

P et ( n t ). The basic scheme we use throughout the topic is the stratified k -

fold cross-validation scheme, where the n -sized ( X dt ,T dt ) dataset is first

randomly partitioned into k partitions of size n t = n/k , such that the classes

maintain their proportional representation (stratification). One is then able

to carry out k design experiments, using each time all the instances of ( k

≈ E

1)

different partitions as design set and testing the classifier in the remaining

partition. Training set and test set averages ( P ed ( n d ) , P et ( n t )) and standard

deviations ( s ( P ed ( n d )) ,s ( P et ( n t ))) of the k experiments are then computed.

The standard deviations are always decreasing with the set cardinalities (as in

−

the binomial approximation P e (1

P e ) /n ). Note that the partition scheme

invalidates the assumption of test sets being independent of training sets.

The practical implications of this, with respect to what we said above about

P et ( n t ) and P ed ( n d ), are not important for not too small n .

Sometimes, the k -fold cross-validation scheme is repeated a certain number

of times, each time with a ( X ds ,T ds ) reshu ing, with improvement of the

quality (confidence interval) of the estimates.

The particular case of the cross-validation scheme with k =2is called

hold-out method. We use in the following examples a simpler version of the

repeated hold-out method ( simple hold-out ) where half of the data is used

−

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home