Introduction - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

node, until meeting some stop criterion at a tree leaf. For a given tree archi-

tecture, the final solution is, as a consequence of the greedy approach, never

guaranteed to be close to z w ∗ . Moreover, it turns out that the empirical error

is usually a poor choice of criterion for local decisions; instead, some other

type of risk criterion is used.

When carrying out risk optimization in regression-like machines, such as

multilayer perceptrons, one must not loose sight of the fact that the fun-

damental reason for attempting to optimize the risk expressed by formula

(1.7), instead of the special case of risk expressed by formula (1.2), is the

tractability of the former (ecient optimization algorithms for continuous

and differentiable loss functions) and the intractability of the latter. That

is, we consider R L ( Y w ) for tractability reasons alone, even though our real

interest is in P e ( Z w ).

Algorithms for the minimization of R L ( Y w ) can be of two sorts: direct

parameter estimation algorithms; iterative optimization algorithms (which

may be of a stochastic or deterministic nature). As an example of the first

category of algorithms, we have the well-known normal equations for solv-

ing linear regression problems, based on a statistical estimation of the linear

coecients. The normal equations are directly derived from the minimiza-

tion of R L ( Y w ) by setting ∂ R L ( Y w ) /∂w =0. This direct approach can be

generalized to more complex regression problems, using kernel tricks (see

e.g., [207]). A disadvantage of the direct methods is that one may have to op-

erate with large matrices when solving practical problems, and large matrices

raise ill-conditioning and regularization issues which may be dicult to cope

with. Iterative optimization algorithms may then be preferred. These follow

a stepwise adaptive approach, adjusting the parameter vector at each step

in an adequate way that takes into account the first order (and sometimes

the second order) derivatives of R L ( Y w ).Examplesofsuch classifier training

(or learning) algorithms are all the gradient descent procedures (such as the

well-known Widrow-Hoff adaptive method and the back-propagation train-

ing of multilayer perceptrons), as well as the popular conjugate gradient and

Levenberg-Marquardt algorithms.

For a given classification problem does a particular choice of loss function,

and therefore of min R L ( Y w ),providethe P e ( Z w ∗ ) solution? We call this

issue the classifier problem : the choice of L leading to a classifier z w + (

·

)

achieving the same performance as z w ∗ (

) (note that there may be several

such classifiers). We shall see in the following chapter that loss functions do

not always perform equally in what concerns this issue, which also motivates

the Minimum Error Entropy approach for classification that we introduce.

·

1.3 Outline of the Topic

The Minimum Error Entropy (MEE) concept is introduced in the following

Chap. 2, where we start by presenting classic risk functionals and discuss

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home