Introduction - Minimum Error Entropy Classification

Information Technology Reference

In-Depth Information

parameter space W . However, except for too simple problems, such a search

is impractical; specifically, for both decision trees and multilayer perceptrons

such a search is known to be NP complete (see, respectively, [108] and [29];

although the proof in the latter respects to 3-node neural networks, its ex-

tension to more complex architectures is obvious). (For the reader unfamiliar

with computational complexity theory, we may informally say that “NP com-

plete” refers to a class of optimization problems for which no known ecient

way to locate a solution exists.)

For regression-like machines such as neural networks, the practical ap-

proach to finding a solution close to z w ∗ is by employing some sort of opti-

mization algorithm (e.g., gradient descent). There is, however, a requirement

of optimization algorithms that precludes using the expectation expressed by

formula (1.2): the loss function must be continuous and differentiable. One

then needs to work with the continuous output, Y w , of these machines and,

instead of using the discontinuous loss function in expression (1.2), one uses

some suitable continuous and differentiable loss function; i.e., instead of at-

tempting to find the minimum probability of error one searches for a solution

minimizing the risk

R L ( Y w )=

X×T

L ( t ( x ) ,y w ( x )) dF X,T ( x, t ) ,

(1.7)

which is an expected value of some continuous and differentiable loss func-

tion L ( t ( x ) ,y w ( x )), y w ∈Y W : R L (Y w )=

E X,T [ L ( t ( X ) ,y w ( X ))],where y w (

)

is now the classifiers's continuous output (preceding the thresholding oper-

ation). The risk, also known in the literature as error criterion, expresses

an average penalization of deviations of y w (

), in accor-

dance to the loss function. For a given machine and data distribution the risk

depends, therefore, on the loss function being used.

Similarly to what happened with the probability of error, we now have to

deal in practice with an empirical risk , computed as:

) from the target t (

x∈X ds

R L ( Y w )= 1

L ( t ( x ) ,y w ( x )) .

(1.8)

Let us assume that, based on the minimization of the empirical risk, we

somehow succeeded in obtaining a good estimate of the minimum risk (not

min P e ) solution; that is, we succeeded in obtaining a good estimate of

R L ( Y w + ) with y w + =arg min

y w ∈Y W

R L ( Y w ) ,

(1.9)

where y w + denotes the minimum risk classifier; z w + = θ ( y w + ) is not guaran-

teed to be z w ∗ ,themin P e classifier in

Y W ).

For decision trees, some kind of greedy algorithm is used, that constructs

the tree as a top-down sequence of locally optimized decisions at each tree

Z W = θ (

Minimum Error Entropy Classification

Search WWH ::

Custom Search

Home