Information Technology Reference
In-Depth Information
Pairwise Classification
For di
cult problems, it is often much safer to split a
C
-class classification
problem into
C
(
C
−
1)
/
2 pairwise classification problems, for the following
reasons:
•
When performing pairwise classification, the designer can take advantage
of many theoretical results and algorithms, pertaining to linear class sepa-
ration; they are fully developed in Chap. 6; we give a cursory introduction
to that material in the next section, entitled linear separability.
•
The resulting networks are much more compact, with fast training and
simple analysis; since each network has a single output, its probabilistic
interpretation is trivial.
•
The features that are relevant for separating class
A
from class
B
are not
necessarily identical to the features that are relevant for separating class
A
from class
C
; therefore, each classifier has only the inputs that are relevant
to its own task, whereas a multilayer Perceptron for global separation must
have all input features that are relevant for the discrimination of all classes;
the feature selection techniques that are described in Chap. 2 can be used
in a very straightforward fashion.
Once the
C
(
C
1)
/
2 posterior probabilities are estimated, possibly with sim-
ple linear separators (neural networks with no hidden neuron), the posterior
probability of class
C
i
for a feature vector
x
is computed as
−
1
Pr(
C
i
|
x
)=
,
C
1
Pr
ij
−
(
C
−
2)
j
=1
,j
=
i
where
C
is the number of classes and Pr
ij
is the posterior probability of class
i
or class
j
, as estimated by the neural network that separates class
C
i
from
class
C
j
.
Linear Separability
Two sets of patterns, described in an n-dimensional feature space, belonging to
two different classes, are said to be “linearly separable” if they lie on different
sides of a hyperplane in feature space.
If two sets of examples are linearly separable, a neural network made of a
single neuron (also termed perceptron can separate them. Consider a neuron
with a sigmoid activation function with
n
inputs; its output is given by
y
=
th [
i
=1
w
i
x
i
]. The simple relation
P
=(
y
+1)
/
2 provides an interpretation
of the output of the classifier as a posterior probability. From Bayes decision
rule, the equation of the boundary between the classes is given by
P
=0
.
5,
or equivalently
y
= 0. Therefore, the separating surface is a hyperplane in
Search WWH ::
Custom Search