Supervised Learning - Hierarchical Neural Networks for Image Interpretation

Information Technology Reference

In-Depth Information

closest to it according to some distance measure d ( x i , x j ) . While the NN classifier

achieves perfect performance on the training set, generalization on a test set might

not be satisfactory. Better generalization is generally achieved if not only the clos-

est training vector is considered, but the classification decision is based on the K

nearest neighbors. If the majority rule is used for the decision, this system is called

KNN classifier.

The KNN classifier stores all examples of the training set. While this requires

no training time, it is frequently advantageous to invest some time for learning in

order to extract the essence of a dataset for later use as data model. This speeds up

the recall and improves generalization.

6.1.2 Decision Trees

Several supervised learning techniques have been proposed in the literature. One

example is decision trees. Breiman et al. [36] proposed classification and regression

trees (CART) for supervised learning problems. Here, the input-vector components

represent attributes of the examples. The inner nodes of decision trees use one at-

tribute at a time to recursively split the input space. Depending on the node's deci-

sion, one of its child nodes is visited until a leaf node is reached. The leaves store

the output of the tree. The order in which the individual attributes are queried is

important for the performance of decision trees. Several algorithms have been pro-

posed for the top-down inference of decision trees from a dataset. One example is

the ID3 algorithm, proposed by Quinlan [181] for classification. It asks the question

first that is most informative about the class. Decision trees can learn any consistent

dataset perfectly, but this may not be desirable. For this reason, ID3 has been ex-

tended to the C4.5 algorithm [182], which uses a Shannon entropy criterion for the

pruning trees. Another possibility to limit the tree size is to grow a tree only until ad-

ditional splitting of nodes produces no significant information gain. The preference

for small trees is motivated by the principle of Occam's razor [32] which prefers a

simpler explanation of a dataset over a more complicated one.

Decision trees work well if the input attributes are discrete, and the dimension-

ality of the input vectors is not too large. However, the class boundaries produced

by decision trees are axis-parallel to the dimensions of the input space. This may

cause difficulties if the true class boundaries are non-aligned.

6.1.3 Bayesian Classifier

The theoretically optimal classifier is based on Bayes' theorem of conditional prob-

ability. It is described by the Bayesian classification rule:

c ( x ) = argmax

j

p ( H j ) · p ( x | H j ) ,

where p ( H j ) is the a-priori probability of class j , p ( x | H j ) describes the distri-

bution of the examples from class j in the input space, and c ( x ) is the classification

produced.

p ( H j | x ) = argmax

j

Hierarchical Neural Networks for Image Interpretation

Search WWH ::

Custom Search

Home