MEE with Continuous Errors - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

d =2we get a decision surface in three-dimensional space, which for a linear

activation function is a plane (the linear discriminant of 3.28) and for a

squashing function an S-folded plane; the decision border is always a line.

In the present section we analyze the perceptron as a regression-like ma-

chine: the learning algorithm iteratively drives the weights such that the

continuous output y i = ϕ ( w T x i + w 0 ) approximates the target value t i for

every x i . The error r.v. whose instantiations are e i = t i −

y i is, therefore, a

continuous random variable.

In order to derive analytical expressions of the theoretical EEs for the

perceptron (and other machines as well), one is compelled to apply transfor-

mations of the input distributions. The well-known theorem of univariate r.v.

transformation (see e.g., [183]) is enough for our purposes:

Theorem 3.2. Let f ( x ) be the PDF of the r.v. X . Assume ϕ ( x ) to be a

monotonic and differentiable function. If g ( y ) is the PDF of Y = ϕ ( X ) and

ϕ ( x )

=0 ,

∀

∈

X ,then

g ( y )= f ( ϕ − 1 ( y ))

inf ϕ ( x ) <y< sup ϕ ( x )

|ϕ ( ϕ − 1 ( y )) |

(3.41)

otherwise

where x = ϕ − 1 ( y ) is the inverse function of y = ϕ ( x ) .

Note that sigmoidal activation functions satisfy the conditions of the theorem.

3.3.1 Motivational Examples

Two examples are now presented which constitute a good illustration on

how the Shannon EE-risk perceptron, with ϕ (

performs when applied to two-class, two-dimensional datasets. They involve

artificial data for which the min P e values are known. We will leave the ap-

plication to real-world datasets to a later section. We will also use these

examples to discuss some convergence issues.

) = tanh(

) and T =

{−

1 , 1

}

Example 3.7. In this example we use two Gaussian distributed class-conditi-

onal PDFs, g ( x ;

μ t , Σ t ),with

μ − 1 =[0 0] T , Σ − 1 =I,andμ 1 =[1 . 50 . 5] T , Σ 1 = 1 . 10 . 3

0 . 31 . 5

Let us consider 300-instance training and test datasets (150 instances per

class) and apply the (Shannon) MEE gradient descent algorithm with h =1

and η =0 . 001. Note that according to formula (E.19) the optimal bandwidth

for n = 150 is h IMSE =0 . 39. We are, therefore, using fat estimation of the

error PDF.

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home