Introduction - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

(A coding scheme such as T =

turns out to be uncomfortable to

deal with.) For instance, for c =4the classifier would have four outputs,

{

1 ,...,c

}

,andclass ω 2 , say, would be represented by [0 1 0 0] T

z 1 ,z 2 ,z 3 ,z 4 }

1] T ; i.e., with z 2 discriminating ω 2 from its complement. We

then see that the 1-of- c coding scheme allows to analyze the classification

task as a set of dichotomizations, whereby any class is discriminated from its

complement. It is not surprising, therefore, that in the following chapters we

will concentrate on studying two-class classification problems; the results of

the study transpose directly to multi-class problems.

Classifiers, such as several types of neural networks (an example of which

are multilayer perceptrons), are regression-like machines: the classifier builds

a continuous approximation to the targets, using some function y w ,whichis

subsequently thresholded in order to yield numerical values according to the

coding scheme. For instance, when neural network outputs are produced by

sigmoid units in the [0 , 1] range one may set a threshold at 0.5; an output

larger than 0.5 is assigned the class label corresponding to T =1(say, ω 2 ),

otherwise to its complement ( ω 2 ), corresponding to T =0. The function

y w can be interpreted in a geometrical sense, with its X

[

−

y w ( X ) graph

representing a decision surface that separates (discriminates) instances from

distinct classes. The classifier mapping is z w : X

→

T ,with z w = θ w ( y w ) and

θ w an appropriate thresholding function.

Neural networks and decision trees are used extensively in the present

topic to illustrate the minimum error entropy approach to supervised classi-

fication. There is an abundant literature on theoretical issues, as well as on

architectures and design methods of neural networks and decision trees. Rel-

evant topics on neural networks are [26,27,142,96]; for decision trees, see [33]

and [177]. Supervised and unsupervised classifiers, especially with a focus on

Pattern Recognition applications, are presented in detail by [59] and [225].

A main reference on theoretical issues concerning classifier operation and

performance is [52].

1.2 Data-Based Learning

For a classifier whose design is based on a set ( X ds ,T ds )=

{

( x i ,t ( x i )); i =

1 , 2 , ..., n

}

some type of approximation of the output z w ( x ) to t ( x ), for all

X , is required. From all possible approximation criteria one can think of

there is one criterion that clearly stands out, which addresses the mismatch

between observed and predicted class labels: how close is the classifier of

achieving the minimum probability of (classification) error, P e ,in X × T ?

We are then interested in the ecacy of the classifier in the population, X×T .

Figure 1.1 illustrates a data-based classification system implementing

z w ∈Z W , where the parameter w is tuned by a learning algorithm , built

expressly for the design process alone. The learning algorithm performs an

∈

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home