Information Technology Reference
In-Depth Information
We wish to find a hyper-plane which will separate the two classes such that
all points on one side of the hyper-plane will be labelled +1, all points on the
other side will be labelled -1. Define a discriminant function
g
(
x
)=
w
∗
·
x
+
b
∗
,
where
w
∗
,b
∗
are the parameters of the optimal hyper-plane. Function
g
(
x
)gives
the distance from an arbitrary
x
to the optimal hyper-plane.
Parameters of the optimal hyperplane are obtained by maximizing the mar-
gin, which is equivalent to minimizing the cost function
2
=
w
Φ
(
w
)=
w
·
w
,
subject to the constraint that
y
i
(
w
·
x
i
+
b
)
−
1
≥
0
,i
=1
,...,N.
This is an optimization problem with inequality constraints and can be solved
by means of Lagrange multipliers. We form the Lagrangian
N
L
(
w
,b,α
)=
1
2
w
α
i
[
y
i
(
w
·
w
−
·
x
i
+
b
)
−
1]
,
i
=1
where
α
i
≥
0 are the Lagrange multipliers. We need to minimize
L
(
w
,b,α
)
with respect to
w
,
b
while requiring that derivatives of
L
(
w
,b,α
) with respect
to all the
α
i
vanish, subject to the constraint that
α
i
≥
0. After solving the
optimization problem the discriminant function
N
y
i
α
i
x
i
·
x
+
b
∗
.
g
(
x
)=
i
=1
where
α
i
,
b
∗
are the parameters of the optimal decision hyperplane. It shows
that the distance can be computed as a weighted sum of the training data and
the Lagrange multipliers, and that the training vectors
x
i
are only used in inner
products.
One can extend linear case to non-linearly separable data by introducing a
kernel function
ϕ
(
x
j
)
,
where
ϕ
(
x
) is some non-linear mapping into (possibly infinite) space
H
,
K
(
x
i
,
x
j
)=
ϕ
(
x
i
)
·
n
ϕ
:
R
−→
H.
Since Support Vector Machines use only inner products to compute the discrim-
inant function, given kernel
K
(
x
i
,x
j
), we can train a SVM without ever having
to know
ϕ
(
x
) [3]. The implication of this is that the number of parameters
that has to be learned by the SVM does not depend on the choice of the kernel
and, therefore, mapping
ϕ
. This gives an obvious computational advantage when
mapping the original feature space into a higher dimensional space which is the
main obstacle in the previous approach based on quadratic regression.
Examples of typical kernel functions are:
Search WWH ::
Custom Search