Information Technology Reference
In-Depth Information
We wish to find a hyper-plane which will separate the two classes such that
all points on one side of the hyper-plane will be labelled +1, all points on the
other side will be labelled -1. Define a discriminant function
g ( x )= w
· x + b ,
where w ,b are the parameters of the optimal hyper-plane. Function g ( x )gives
the distance from an arbitrary x to the optimal hyper-plane.
Parameters of the optimal hyperplane are obtained by maximizing the mar-
gin, which is equivalent to minimizing the cost function
2 = w
Φ ( w )=
w
· w ,
subject to the constraint that
y i ( w
· x i + b )
1
0 ,i =1 ,...,N.
This is an optimization problem with inequality constraints and can be solved
by means of Lagrange multipliers. We form the Lagrangian
N
L ( w ,b,α )= 1
2 w
α i [ y i ( w
· w
· x i + b )
1] ,
i =1
where α i
0 are the Lagrange multipliers. We need to minimize L ( w ,b,α )
with respect to w , b while requiring that derivatives of L ( w ,b,α ) with respect
to all the α i
vanish, subject to the constraint that α i
0. After solving the
optimization problem the discriminant function
N
y i α i x i
· x + b .
g ( x )=
i =1
where α i , b are the parameters of the optimal decision hyperplane. It shows
that the distance can be computed as a weighted sum of the training data and
the Lagrange multipliers, and that the training vectors x i are only used in inner
products.
One can extend linear case to non-linearly separable data by introducing a
kernel function
ϕ ( x j ) ,
where ϕ ( x ) is some non-linear mapping into (possibly infinite) space H ,
K ( x i , x j )= ϕ ( x i )
·
n
ϕ :
R
−→
H.
Since Support Vector Machines use only inner products to compute the discrim-
inant function, given kernel K ( x i ,x j ), we can train a SVM without ever having
to know ϕ ( x ) [3]. The implication of this is that the number of parameters
that has to be learned by the SVM does not depend on the choice of the kernel
and, therefore, mapping ϕ . This gives an obvious computational advantage when
mapping the original feature space into a higher dimensional space which is the
main obstacle in the previous approach based on quadratic regression.
Examples of typical kernel functions are:
Search WWH ::




Custom Search