Information Technology Reference
In-Depth Information
general formulation one has a learning machine whose algorithm attempts to
achieve a minimum approximation performance measure, with distinct flavors
according to the problem type (see e.g., [41]). In the present topic our sole
concern is with classification problems.
Regarding the approximation performance of classifiers, several important
points are to be noted:
1. We usually don't know F X,T , which renders impossible the integration in
formula (1.2), even in the improbable scenario that other technicalities
were easily solvable. In practical real-world problems, where one designs
z w using the n -sized dataset ( X ds ,T ds ), one can only obtain an estimate of
P e , the so-called empirical error estimate (or empirical error rate), which
is simply the average number of misclassified instances:
n
x
P e ( n )= 1
P e ( Z w )
( x ) .
(1.4)
{
t
= z w }
X ds
2. The question then arises of how large the training set size, n , should be such
that with a given confidence (say, 95% confidence) the empirical error rate
(the observed error ), P e , does not deviate from the probability of error (the
true error ), P e , more than a fixed tolerance. The answer to this question
is not dicult for model-based classifiers: just use the available techniques
of statistical estimation (namely, the computation of confidence intervals).
For data-based classifiers this generalization issue is rather more intricate.
In both cases its analysis leads one to using a “functional complexity” of
Z W adequate to the training set size. Intuitively, too much complexity of
Z W will tend to produce complex decision surfaces that tightly discrimi-
nate the training set instances, but will be unable to perform equally well
(generalize) in new independent datasets (the over-fitting phenomenon);
too less complexity will produce a classifier that generalizes well but with
poor performance, failing to extract from the data all possible information
( under-fitting ). The present topic doesn't include a discussion on the gen-
eralization issue and related issues such as dimensionality reduction and
function regularization. Theoretical and practical aspects on these topics
are discussed in detail in [52,230,41,227,11,228]. For shorter but well struc-
tured and informative surveys, see [30,233]. We will include a few remarks
on these aspects when appropriate. We will also make use of the simple
generalization measure
P e |
, which is small if the classifier generalizes
well. Note that generalization has nothing to do with P e being small; it
really means that P e is a good estimate of P e .
3. For a classifier architecture that implements
|
P e
}
what is really of interest to us is not some z w but the best of them all. In
other words, one requires the learning algorithm to achieve the minimum
probability of error in
Z W =
{
z w : X
T ; w
W
Z W , expressed as follows:
Find w s.t. P e ( Z w ) is minimum.
Search WWH ::




Custom Search