Information Technology Reference
In-Depth Information
(A coding scheme such as
T
=
turns out to be uncomfortable to
deal with.) For instance, for
c
=4the classifier would have four outputs,
{
{
1
,...,c
}
,andclass
ω
2
, say, would be represented by [0 1 0 0]
T
z
1
,z
2
,z
3
,z
4
}
or
1]
T
; i.e., with
z
2
discriminating
ω
2
from its complement. We
then see that the 1-of-
c
coding scheme allows to analyze the classification
task as a set of dichotomizations, whereby any class is discriminated from its
complement. It is not surprising, therefore, that in the following chapters we
will concentrate on studying two-class classification problems; the results of
the study transpose directly to multi-class problems.
Classifiers, such as several types of neural networks (an example of which
are multilayer perceptrons), are
regression-like
machines: the classifier builds
a
continuous
approximation to the targets, using some function
y
w
,whichis
subsequently thresholded in order to yield numerical values according to the
coding scheme. For instance, when neural network outputs are produced by
sigmoid units in the [0
,
1] range one may set a threshold at 0.5; an output
larger than 0.5 is assigned the class label corresponding to
T
=1(say,
ω
2
),
otherwise to its complement (
ω
2
), corresponding to
T
=0. The function
y
w
can be interpreted in a geometrical sense, with its
X
[
−
11
−
1
−
y
w
(
X
) graph
representing a
decision surface
that separates (discriminates) instances from
distinct classes. The classifier mapping is
z
w
:
X
×
→
T
,with
z
w
=
θ
w
(
y
w
) and
θ
w
an appropriate thresholding function.
Neural networks and decision trees are used extensively in the present
topic to illustrate the minimum error entropy approach to supervised classi-
fication. There is an abundant literature on theoretical issues, as well as on
architectures and design methods of neural networks and decision trees. Rel-
evant topics on neural networks are [26,27,142,96]; for decision trees, see [33]
and [177]. Supervised and unsupervised classifiers, especially with a focus on
Pattern Recognition applications, are presented in detail by [59] and [225].
A main reference on theoretical issues concerning classifier operation and
performance is [52].
1.2 Data-Based Learning
For a classifier whose design is based on a set (
X
ds
,T
ds
)=
{
(
x
i
,t
(
x
i
));
i
=
1
,
2
, ..., n
}
some type of approximation of the output
z
w
(
x
) to
t
(
x
), for all
x
X
, is required. From all possible approximation criteria one can think of
there is one criterion that clearly stands out, which addresses the mismatch
between observed and predicted class labels: how close is the classifier of
achieving the minimum probability of (classification) error,
P
e
,in
X × T
?
We are then interested in the ecacy of the classifier in the population,
X×T
.
Figure 1.1 illustrates a data-based classification system implementing
z
w
∈Z
W
, where the parameter
w
is tuned by a
learning algorithm
, built
expressly for the design process alone. The learning algorithm performs an
∈