Information Technology Reference
In-Depth Information
that case, usually the class is decided through a vote, based on the value of
the potential of the output neuron. The underlying rationale, called winner
takes all (WTA), is that the larger the potential on the output neuron, the
more confident we are on its classification.
We will show below that the probabilistic interpretation of the classifica-
tion is based on the distance of the examples to the discriminant surfaces, that
is, the absolute value of the potential divided by the norm of the weight vec-
tor. Therefore, our confidence in a classification should be based on distances
and not on bare potentials, unless the weights are normalized. But a deeper
problem posed by the WTA procedure is the following: the output unit only
reflects the properties of the internal representations. Our confidence should
depend on the distances of the input vector to the discriminant surfaces in
input space, which are proportional to the potentials of the hidden neurons.
It may happen that the input pattern lies so close to one discriminant surface
in input space that its class is uncertain. However, its internal representation
may have a large stability (see Fig. 6.18), and win in the WTA procedure
against the other classifiers.
Another way of dealing with the problem of multiple classes is to construct
trees of neural networks. To this end, we choose a sequence of classes in an
arbitrary order, for example
{
K, 2 ,..., 1
}
, and we learn the discrimination
between the first class and the K
1 others. In our example, we may define
targets y = 1 for the examples of the first class (in our example, y K ), and y =
1 for the others. Then, we restrict the training set to patterns of the classes
not yet discriminated (
in our example), and we learn the separation
of class 2 from the others. The procedure is repeated until the two remaining
classes are separated. One interest of this heuristics is that the successive
training sets have decreasing sizes. The resulting network has a tree structure.
In order to classify a new input, it has to be first classified by the first network.
If the output is σ = +1, the class is K . Otherwise ( σ =
{
2 ,..., 1
}
1) the pattern is
presented as input to the second network. The procedure stops as soon as
one network recognizes (output σ = +1) the pattern. Since the sequence of
classes selected at the beginning is arbitrary, in principle one should compare
the outputs of different trees, each tree corresponding to a different sequence
of classes. However, if the number of classes is large (typically for K> 4)
this method is inapplicable. Another solution was proposed in the section
“methodology” of Chap. 1: if the classes not mutually linearly separable one
may resort to pairwise separation. For a problem with K classes, this requires
the construction of K ( K +1) / 2 classifiers which in many practical applications
turn out to be linear. Since there is no arbitrary sequence chosen a priori, there
is no need to compare the outputs of K ! classifiers. One advantage of this
solution is that one can use different descriptors for the different separations,
which may simplify the problem. We have shown in Chap. 1 how to estimate
the probability that a given pattern belongs to each of the possible classes,
based on the results obtained in the pairwise separations.
Search WWH ::




Custom Search