Information Technology Reference
In-Depth Information
the considered heuristics. However, it is important to emphasize that having
a unique solution is mathematically satisfactory, but does not guarantee that
the classifier has good generalization properties!
The basic idea underlying the SVM's is quite old [Cover 1965]: instead of
using a multilayered neural network, Cover proposed to perform an application
of input space x
R N
R N to a space Φ ( x )
of higher dimension N
N ,
called feature space, where the discriminant surface is linear. For example, the
quadratic application
Φ = x 1 ,x 2 ,...,x N ,x 1 ,x 1 x 2 ,...,x 1 x N ,x 2 ,x 2 x 3 ,...x N− 1 x N ,x 2
N
ϕ : x
is an example where vector Φ has N = N + N ( N +1) / 2 components: the N
components of x plus the N ( N +1) / 2 products of couples of components of
x . Training sets that are separable by a second order surface in input space
x
R N become linearly separable in the space of quadratic features R N ,
and a simple perceptron in feature space can solve the problem. When the
problem is linearly separable, we have already seen that there is an infinite
number of separating hyperplanes. The SVM solution is the solution with
maximal stability in feature space. If the following condition on the aligned
fields in feature space is obeyed,
y k Φ ( x k )
·
w > 1 ,
the margin, which is the distance of the hyperplane to the closest example is
1 /
. Therefore, maximizing the margin is equivalent to minimizing a cost
that is the norm of the weight vector,
w
E = 1
2 w
·
w
under the above constraints of aligned fields larger than 1 in feature space. If
the latter can be satisfied, the solution is a hard margin SVM. Minimizing E
with those constraints is a quadratic optimization problem in dimension N ,
in a convex domain. Its solution is unique, and several algorithms have been
optimized to solve it (some of them are available at the URL http://kernel-
machines.org). The important point is that the solution can be expressed as
follows:
M
c k y k Φ ( x k ) ,
w =
K =1
with c k
0. It can be shown that there are two subsets of examples: those
with c k = 0 and those with c k > 0. The former do not contribute to the
weights. If they were excluded from the training set, the solution would be
the same. Those with c k > 0, which actually determine the solution, are called
support vectors. They verify y k w
Φ ( x k )
1, meaning that they are exactly
at the margin. In other words, they lie at a distance 1 /
·
w
of the separating
hyperplane.
Search WWH ::




Custom Search