MEE with Continuous Errors - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

and

− w T

w 0 , w T Σ X|t w ) .

f E|t ( e )= f Y |t ( t

−

e )= g ( e ; t

μ X|t −

(3.31)

We now proceed to compute the information potential as in Example 3.1:

+ ∞

f E|t ( e ) de = g (0; 0 , √ 2 σ Y |t ) and

−∞

+ ∞

d, √ 2 σ m ) ,

f E|− 1 ( e ) f E| 1 ( e ) de = g (0; 2

−

(3.32)

−∞

with σ Y |t = w T Σ X|t w ,σ m = σ Y |− 1 + σ Y | 1 ,and d = w T (

μ X| 1 − μ X|− 1 ).

Hence, for equal priors:

V R 2 ≡

V R 2 ( d, σ Y |− 1 ,σ Y | 1 )=

8 √ π

σ Y |− 1 +

σ m exp

d ) 2

4 σ 2 m

σ Y |− 1 +

−

(3.33)

It is clear that Rényi's quadratic entropy doesn't depend on w 0 .Thisisa

direct consequence of the invariance of entropy to translations, since from

(3.31) we observe that

− w T

μ X|− 1 − w T

w 0 . Shannon's entropy

and α -order Rényi's entropies are insensitive to the constant w 0 term.

Things are different, however, when a linear classifier is trained with gradi-

ent descent using empirical entropies. Off the convergent solution, the e i −

[ E ]=

μ X| 1 −

e j

deviations in formula (3.3) are scattered, and the estimate f ( e ) doesn't usu-

ally reproduce well a sum of Gaussians with the above mean value. As a

consequence, the bias term of the solution will undergo adjustments. Near

the convergent solution, with the e i −

e j deviations crowding a small interval,

the f ( e ) estimate then provides a close approximation of the theoretical error

PDF and the insensitivity to bias adjustments plays its role.

This empirical MEE behavior is illustrated in the following bivariate two-

class example, where Shannon's entropy gradient descent is used.

Example 3.3. Consider two normally distributed class-conditional PDFs, g ( x ;

μ t , Σ t ),with

μ − 1 =[0 0] T ,

μ 1 =[2 0] T , Σ − 1 = Σ 1 = I .

Independent training and test datasets with n = 250-instances (125 instances

per class) were generated and the Shannon MEE algorithm applied with

h =1and η =0 . 001 2 . Note that according to formula (E.19) the optimal

bandwidth for the number of instances being used is h IMSE =0 . 4.Weare,

therefore, using fat estimation of the error PDF.

2 From now on the indicated η values are initial values of an adaptive rule to be

described in Sect. 6.1.1.

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home