Information Technology Reference
In-Depth Information
The Bayesian classification rule is optimal but difficult to apply in real-world
situations. The reason for this is that neither p ( H j ) nor p ( x | H j ) are known. While
estimation of p ( H j ) from the dataset is straightforward, it is difficult to model the
example distributions p ( x | H j ) if the dimensionality of the input space is large.
This problem is known as 'the curse of dimensionality' [27]. If, for instance, a
grid-based representation is used to estimate a distribution, the number of grid cells
grows exponentially with the dimension of the space when the resolution for each
dimension is kept constant. Furthermore, in high-dimensional spaces the distance of
randomly chosen points tends to be constant for any distance measure [29]. Finally,
sets of points that cannot be separated linearly in a low-dimensional space may
become linearly separable when transformed into a space of higher dimension.
6.1.4 Support Vector Machines
The last property of high-dimensional spaces is exploited by kernel methods. The
idea behind kernel methods is to transform data vectors into a feature space that usu-
ally has a huge or even infinite dimensionality. The kernel trick allows for working
in that feature space without expressing it explicitly. High-dimensional dot products
are computed by kernels as functions of the original data vectors.
If multiple linear separations of data vectors are possible, which one of the sepa-
rations should be chosen? This question is considered by the theory of structural risk
minimization. While empirical risk minimization focuses on good performance for
the training set only, structural risk minimization tries to find the learning machine
that yields a good trade-off between low empirical risk and small capacity.
The principle of structural risk was proposed by Vapnik and Chervonenkis [233].
According to this principle the number of training examples must be large compared
to the degrees of freedom of a learning machine. The capacity of a learning machine
is measured by the VC dimension. It is defined as the size of the largest set of points
that can be split by the learning machine into two subsets in all possible ways.
Large-margin classifiers are one application of structural risk minimization. The
idea is that a linear separation that has a large margin to all training examples is
preferred against a separation with a small margin. This improves generalization
since it reduces the degrees of freedom of the classifier.
In the last years, support-vector machines (SVM) [47] have become popular
classifiers. They express the classification decision function in terms of a subset of
the training examples which are close to a decision boundary, the support vectors.
Using the kernel trick and structural risk minimization, they separate the classes in
high-dimensional spaces with large margins. Powerful optimization techniques have
been developed for support-vector machines [207].
6.1.5 Bias/Variance Dilemma
All supervised learning systems face the bias/variance dilemma [81]. The error of
such a system can be divided into three components:
Search WWH ::




Custom Search