Recognition of Whitehead-Minimal Elements in Free Groups of Large Ranks - Artificial Intelligence and Symbolic Computation

Information Technology Reference

In-Depth Information

We wish to find a hyper-plane which will separate the two classes such that

all points on one side of the hyper-plane will be labelled +1, all points on the

other side will be labelled -1. Define a discriminant function

g ( x )= w ∗

· x + b ∗ ,

where w ∗ ,b ∗ are the parameters of the optimal hyper-plane. Function g ( x )gives

the distance from an arbitrary x to the optimal hyper-plane.

Parameters of the optimal hyperplane are obtained by maximizing the mar-

gin, which is equivalent to minimizing the cost function

2 = w

Φ ( w )=

· w ,

subject to the constraint that

y i ( w

· x i + b )

−

≥

0 ,i =1 ,...,N.

This is an optimization problem with inequality constraints and can be solved

by means of Lagrange multipliers. We form the Lagrangian

L ( w ,b,α )= 1

2 w

α i [ y i ( w

· w −

· x i + b )

−

1] ,

i =1

where α i ≥

0 are the Lagrange multipliers. We need to minimize L ( w ,b,α )

with respect to w , b while requiring that derivatives of L ( w ,b,α ) with respect

to all the α i

vanish, subject to the constraint that α i ≥

0. After solving the

optimization problem the discriminant function

y i α i x i

· x + b ∗ .

g ( x )=

i =1

where α i , b ∗ are the parameters of the optimal decision hyperplane. It shows

that the distance can be computed as a weighted sum of the training data and

the Lagrange multipliers, and that the training vectors x i are only used in inner

products.

One can extend linear case to non-linearly separable data by introducing a

kernel function

ϕ ( x j ) ,

where ϕ ( x ) is some non-linear mapping into (possibly infinite) space H ,

K ( x i , x j )= ϕ ( x i )

ϕ :

−→

Since Support Vector Machines use only inner products to compute the discrim-

inant function, given kernel K ( x i ,x j ), we can train a SVM without ever having

to know ϕ ( x ) [3]. The implication of this is that the number of parameters

that has to be learned by the SVM does not depend on the choice of the kernel

and, therefore, mapping ϕ . This gives an obvious computational advantage when

mapping the original feature space into a higher dimensional space which is the

main obstacle in the previous approach based on quadratic regression.

Examples of typical kernel functions are:

Artificial Intelligence and Symbolic Computation

Search WWH ::

Custom Search

Home