Information Technology Reference
In-Depth Information
node, until meeting some stop criterion at a tree leaf. For a given tree archi-
tecture, the final solution is, as a consequence of the greedy approach, never
guaranteed to be close to z w . Moreover, it turns out that the empirical error
is usually a poor choice of criterion for local decisions; instead, some other
type of risk criterion is used.
When carrying out risk optimization in regression-like machines, such as
multilayer perceptrons, one must not loose sight of the fact that the fun-
damental reason for attempting to optimize the risk expressed by formula
(1.7), instead of the special case of risk expressed by formula (1.2), is the
tractability of the former (ecient optimization algorithms for continuous
and differentiable loss functions) and the intractability of the latter. That
is, we consider R L ( Y w ) for tractability reasons alone, even though our real
interest is in P e ( Z w ).
Algorithms for the minimization of R L ( Y w ) can be of two sorts: direct
parameter estimation algorithms; iterative optimization algorithms (which
may be of a stochastic or deterministic nature). As an example of the first
category of algorithms, we have the well-known normal equations for solv-
ing linear regression problems, based on a statistical estimation of the linear
coecients. The normal equations are directly derived from the minimiza-
tion of R L ( Y w ) by setting ∂ R L ( Y w ) /∂w =0. This direct approach can be
generalized to more complex regression problems, using kernel tricks (see
e.g., [207]). A disadvantage of the direct methods is that one may have to op-
erate with large matrices when solving practical problems, and large matrices
raise ill-conditioning and regularization issues which may be dicult to cope
with. Iterative optimization algorithms may then be preferred. These follow
a stepwise adaptive approach, adjusting the parameter vector at each step
in an adequate way that takes into account the first order (and sometimes
the second order) derivatives of R L ( Y w ).Examplesofsuch classifier training
(or learning) algorithms are all the gradient descent procedures (such as the
well-known Widrow-Hoff adaptive method and the back-propagation train-
ing of multilayer perceptrons), as well as the popular conjugate gradient and
Levenberg-Marquardt algorithms.
For a given classification problem does a particular choice of loss function,
and therefore of min R L ( Y w ),providethe P e ( Z w ) solution? We call this
issue the classifier problem : the choice of L leading to a classifier z w + (
·
)
achieving the same performance as z w (
) (note that there may be several
such classifiers). We shall see in the following chapter that loss functions do
not always perform equally in what concerns this issue, which also motivates
the Minimum Error Entropy approach for classification that we introduce.
·
1.3 Outline of the Topic
The Minimum Error Entropy (MEE) concept is introduced in the following
Chap. 2, where we start by presenting classic risk functionals and discuss
 
Search WWH ::




Custom Search