Supervised Learning - Hierarchical Neural Networks for Image Interpretation

Information Technology Reference

In-Depth Information

The Bayesian classification rule is optimal but difficult to apply in real-world

situations. The reason for this is that neither p ( H j ) nor p ( x | H j ) are known. While

estimation of p ( H j ) from the dataset is straightforward, it is difficult to model the

example distributions p ( x | H j ) if the dimensionality of the input space is large.

This problem is known as 'the curse of dimensionality' [27]. If, for instance, a

grid-based representation is used to estimate a distribution, the number of grid cells

grows exponentially with the dimension of the space when the resolution for each

dimension is kept constant. Furthermore, in high-dimensional spaces the distance of

randomly chosen points tends to be constant for any distance measure [29]. Finally,

sets of points that cannot be separated linearly in a low-dimensional space may

become linearly separable when transformed into a space of higher dimension.

6.1.4 Support Vector Machines

The last property of high-dimensional spaces is exploited by kernel methods. The

idea behind kernel methods is to transform data vectors into a feature space that usu-

ally has a huge or even infinite dimensionality. The kernel trick allows for working

in that feature space without expressing it explicitly. High-dimensional dot products

are computed by kernels as functions of the original data vectors.

If multiple linear separations of data vectors are possible, which one of the sepa-

rations should be chosen? This question is considered by the theory of structural risk

minimization. While empirical risk minimization focuses on good performance for

the training set only, structural risk minimization tries to find the learning machine

that yields a good trade-off between low empirical risk and small capacity.

The principle of structural risk was proposed by Vapnik and Chervonenkis [233].

According to this principle the number of training examples must be large compared

to the degrees of freedom of a learning machine. The capacity of a learning machine

is measured by the VC dimension. It is defined as the size of the largest set of points

that can be split by the learning machine into two subsets in all possible ways.

Large-margin classifiers are one application of structural risk minimization. The

idea is that a linear separation that has a large margin to all training examples is

preferred against a separation with a small margin. This improves generalization

since it reduces the degrees of freedom of the classifier.

In the last years, support-vector machines (SVM) [47] have become popular

classifiers. They express the classification decision function in terms of a subset of

the training examples which are close to a decision boundary, the support vectors.

Using the kernel trick and structural risk minimization, they separate the classes in

high-dimensional spaces with large margins. Powerful optimization techniques have

been developed for support-vector machines [207].

6.1.5 Bias/Variance Dilemma

All supervised learning systems face the bias/variance dilemma [81]. The error of

such a system can be divided into three components:

Search WWH ::

Custom Search

Home