Information Technology Reference

In-Depth Information

We wish to find a hyper-plane which will separate the two classes such that

all points on one side of the hyper-plane will be labelled +1, all points on the

other side will be labelled -1. Define a discriminant function

g
(
x
)=
w
∗

·
x
+
b
∗
,

where
w
∗
,b
∗
are the parameters of the optimal hyper-plane. Function
g
(
x
)gives

the distance from an arbitrary
x
to the optimal hyper-plane.

Parameters of the optimal hyperplane are obtained by maximizing the mar-

gin, which is equivalent to minimizing the cost function

2
=
w

Φ
(
w
)=

w

·
w
,

subject to the constraint that

y
i
(
w

·
x
i
+
b
)

−

1

≥

0
,i
=1
,...,N.

This is an optimization problem with inequality constraints and can be solved

by means of Lagrange multipliers. We form the Lagrangian

N

L
(
w
,b,α
)=
1

2
w

α
i
[
y
i
(
w

·
w
−

·
x
i
+
b
)

−

1]
,

i
=1

where
α
i
≥

0 are the Lagrange multipliers. We need to minimize
L
(
w
,b,α
)

with respect to
w
,
b
while requiring that derivatives of
L
(
w
,b,α
) with respect

to all the
α
i

vanish, subject to the constraint that
α
i
≥

0. After solving the

optimization problem the discriminant function

N

y
i
α
i
x
i

·
x
+
b
∗
.

g
(
x
)=

i
=1

where
α
i
,
b
∗
are the parameters of the optimal decision hyperplane. It shows

that the distance can be computed as a weighted sum of the training data and

the Lagrange multipliers, and that the training vectors
x
i
are only used in inner

products.

One can extend linear case to non-linearly separable data by introducing a

kernel function

ϕ
(
x
j
)
,

where
ϕ
(
x
) is some non-linear mapping into (possibly infinite) space
H
,

K
(
x
i
,
x
j
)=
ϕ
(
x
i
)

·

n

ϕ
:

R

−→

H.

Since Support Vector Machines use only inner products to compute the discrim-

inant function, given kernel
K
(
x
i
,x
j
), we can train a SVM without ever having

to know
ϕ
(
x
) [3]. The implication of this is that the number of parameters

that has to be learned by the SVM does not depend on the choice of the kernel

and, therefore, mapping
ϕ
. This gives an obvious computational advantage when

mapping the original feature space into a higher dimensional space which is the

main obstacle in the previous approach based on quadratic regression.

Examples of typical kernel functions are:

Search WWH ::

Custom Search