Discrimination - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

the considered heuristics. However, it is important to emphasize that having

a unique solution is mathematically satisfactory, but does not guarantee that

the classifier has good generalization properties!

The basic idea underlying the SVM's is quite old [Cover 1965]: instead of

using a multilayered neural network, Cover proposed to perform an application

of input space x

R N

R N to a space Φ ( x )

of higher dimension N

N ,

called feature space, where the discriminant surface is linear. For example, the

quadratic application

∈

Φ = x 1 ,x 2 ,...,x N ,x 1 ,x 1 x 2 ,...,x 1 x N ,x 2 ,x 2 x 3 ,...x N− 1 x N ,x 2

ϕ : x

→

is an example where vector Φ has N = N + N ( N +1) / 2 components: the N

components of x plus the N ( N +1) / 2 products of couples of components of

x . Training sets that are separable by a second order surface in input space

R N become linearly separable in the space of quadratic features R N ,

and a simple perceptron in feature space can solve the problem. When the

problem is linearly separable, we have already seen that there is an infinite

number of separating hyperplanes. The SVM solution is the solution with

maximal stability in feature space. If the following condition on the aligned

fields in feature space is obeyed,

∈

y k Φ ( x k )

w > 1 ,

the margin, which is the distance of the hyperplane to the closest example is

1 /

. Therefore, maximizing the margin is equivalent to minimizing a cost

that is the norm of the weight vector,

E = 1

2 w

under the above constraints of aligned fields larger than 1 in feature space. If

the latter can be satisfied, the solution is a hard margin SVM. Minimizing E

with those constraints is a quadratic optimization problem in dimension N ,

in a convex domain. Its solution is unique, and several algorithms have been

optimized to solve it (some of them are available at the URL http://kernel-

machines.org). The important point is that the solution can be expressed as

follows:

c k y k Φ ( x k ) ,

w =

K =1

with c k ≥

0. It can be shown that there are two subsets of examples: those

with c k = 0 and those with c k > 0. The former do not contribute to the

weights. If they were excluded from the training set, the solution would be

the same. Those with c k > 0, which actually determine the solution, are called

support vectors. They verify y k w

Φ ( x k )

1, meaning that they are exactly

at the margin. In other words, they lie at a distance 1 /

≡

of the separating

hyperplane.

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home