Information Technology Reference
In-Depth Information
The best classifier is
z
w
∗
=argmin
z
w
∈Z
W
P
e
(
Z
w
)
.
(1.5)
T
, with optimal parameter
w
∗
,isthebestone(in
the minimum
P
e
sense) in the family
The classifier
z
w
∗
:
X
→
Z
W
. We will often denote
P
e
(
Z
w
∗
)
simply as min
P
e
, signifying min
Z
W
P
e
, the minimum probability of error
for the functional family allowed by the classifier architecture.
An important aspect concerning the estimates
P
e
(
n
) pro-
duced by a classifier is whether or not they will converge (in some sense)
with growing
n
to min
P
e
.This
consistency
issue of the learning algorithm
will be addressed when appropriate.
4. If one knew the class priors,
P
(
t
k
), and the class conditional distributions
of the targets,
p
(
x
P
e
(
Z
w
)
≡
t
k
),with
p
representing either a PMF or a PDF, one
would then be able to determine the best possible classifier based on the
Bayes decision theory: just pick the class that maximizes the posterior
probability
|
c
x
)=
p
(
x
|
t
k
)
P
(
t
k
)
p
(
x
)
P
(
t
k
|
,
with
p
(
x
)=
p
(
x
|
t
k
)
P
(
t
k
)
.
(1.6)
k
=1
This is the procedure followed by the model-based approach to classifica-
tion. The best —
P
(
t
k
|
x
) maximizing — classifier is known as the Bayes
classifier,
z
Bayes
.
One always has
P
e
(
Z
w
)
≥ P
e
(
Z
w
∗
)
≥ P
e
(
Z
Bayes
). Note that there will
be function families
Z
B
such that
z
w
∗
(
·
)=
z
Bayes
(
·
) with
w
∗
∈ B
(e.g.,
multilayer perceptrons with “enough” hidden neurons are known to have
universal functional approximation capabilities); however, one usually will
not be sure whether or not
Z
B
is implementable by the classification sys-
tem being used (for multilayer perceptrons, “enough” may not be afford-
able, among other things because of the generalization issue). We, there-
fore, will not pursue the task of analyzing the approximation of data-based
classifiers to
z
Bayes
.
We also shall not discuss whether
z
w
is convergent with
n
(in some
sense) to
z
Bayes
, the so-called Bayes-consistency issue, largely dependent
on the classification system being used; as a matter of fact, the lack of
Bayes-consistency does not preclude the usefulness of a classification sys-
tem (binary decision trees with impurity decision rules are an example of
that). For details on the consistency of classification systems the reader
may find useful to consult [52] and [11].
Let us now address the problem of how to find the best classifier
z
w
∗
,afford-
able by the function family
Z
W
implemented by the classification system.
One could consider using formula (1.4) (with large
n
so that
P
e
(
n
) is close to
P
e
(
Z
w
∗
)) and perform an exhaustive search in some discrete version of the