Information Technology Reference
In-Depth Information
parameter space W . However, except for too simple problems, such a search
is impractical; specifically, for both decision trees and multilayer perceptrons
such a search is known to be NP complete (see, respectively, [108] and [29];
although the proof in the latter respects to 3-node neural networks, its ex-
tension to more complex architectures is obvious). (For the reader unfamiliar
with computational complexity theory, we may informally say that “NP com-
plete” refers to a class of optimization problems for which no known ecient
way to locate a solution exists.)
For regression-like machines such as neural networks, the practical ap-
proach to finding a solution close to z w is by employing some sort of opti-
mization algorithm (e.g., gradient descent). There is, however, a requirement
of optimization algorithms that precludes using the expectation expressed by
formula (1.2): the loss function must be continuous and differentiable. One
then needs to work with the continuous output, Y w , of these machines and,
instead of using the discontinuous loss function in expression (1.2), one uses
some suitable continuous and differentiable loss function; i.e., instead of at-
tempting to find the minimum probability of error one searches for a solution
minimizing the risk
R L ( Y w )=
X×T
L ( t ( x ) ,y w ( x )) dF X,T ( x, t ) ,
(1.7)
which is an expected value of some continuous and differentiable loss func-
tion L ( t ( x ) ,y w ( x )), y w ∈Y W : R L (Y w )=
E X,T [ L ( t ( X ) ,y w ( X ))],where y w (
·
)
is now the classifiers's continuous output (preceding the thresholding oper-
ation). The risk, also known in the literature as error criterion, expresses
an average penalization of deviations of y w (
), in accor-
dance to the loss function. For a given machine and data distribution the risk
depends, therefore, on the loss function being used.
Similarly to what happened with the probability of error, we now have to
deal in practice with an empirical risk , computed as:
·
) from the target t (
·
n
x∈X ds
R L ( Y w )= 1
L ( t ( x ) ,y w ( x )) .
(1.8)
Let us assume that, based on the minimization of the empirical risk, we
somehow succeeded in obtaining a good estimate of the minimum risk (not
min P e ) solution; that is, we succeeded in obtaining a good estimate of
R L ( Y w + ) with y w + =arg min
y w ∈Y W
R L ( Y w ) ,
(1.9)
where y w + denotes the minimum risk classifier; z w + = θ ( y w + ) is not guaran-
teed to be z w ,themin P e classifier in
Y W ).
For decision trees, some kind of greedy algorithm is used, that constructs
the tree as a top-down sequence of locally optimized decisions at each tree
Z W = θ (
 
Search WWH ::




Custom Search