Information Technology Reference
In-Depth Information
Theorem 3 Euclidean distance does not fulfill
the conditions (D1)-(D5).
Support Vector Machines are based on the work
of Vapnik in statistical learning theory (Vapnik,
1995). They aim to minimize the regularized
risk R reg ( f ) of a learned function f which is the
weighted sum of the empirical risk R emp ( f ) and a
complexity term || w || 2 :
Proof: We give a counterexample. We assume
that a weighting function w is given which fulfills
the conditions (W1)-(W4). Further assume that
learning tasks t i,tj are given with no alternative
features. We add an alternative feature X ik to X B
and get X B = X B { X ik } with X il X B : X il ~ X ik .
We infer from conditions (W2) and (W3) that
R reg [ f ] = R emp [ f ] +λ || w || 2
The result is a linear decision function y =
sgn( w x + b ) with a minimal length of w . The
vector w is the normal vector of an optimal hy-
perplane with a maximal margin to both classes.
One of the strengths of SVMs is the use of kernel
functions to extend the feature space and allow
linear decision boundaries after efficient nonlin-
ear transformations of the input (Schölkopf and
Smola, 2002). Since our goal is the construction
of (nonlinear) features during preprocessing we
can just use the most simple kernel function which
is the dot product. In this case the components
of the vector w can be interpreted as weights for
all features.
w
(
( X il )
X
)
w
(
( X jl )
X
)
w' ( X ik ) = w' ( X il ) =
and w' ( X ik ) = w' ( X il ) =
l
l
2
2
and from condition (W4) that
p k : w' ( X ip ) = w ( X ip ) and ∀ p k : w' ( X jp ) = w ( X jp )
In this case the following holds for the Eu-
clidean distance
2
w
(
X
)
w
(
X
)
jk
=
S
+
2
w
(
X
)
w
(
X
))²
=
S
+
2
ik
d ( t i ,t j )
ik
jk
2
2
1
2
2
=
S
+
(
w
(
X
)
w
(
X
)
S
+
(
w
(
X
)
w
(
X
)
ik
jk
ik
jk
2
= d ( t i ,t j )
Theorem 4 The feature weight calculation of
SVMs with linear kernel function meets the con-
ditions (W1)-(W4).
With
|
X
|
|
X
|
B
B
2
2
S
=
(
w
(
X
)
w
(
X
)
=
(
w
(
X
)
w
(
X
)
ip
jp
ip
jp
p
=
1
p
k
p
=
1
p
k
Proof: Since these conditions can be proved
for a single learning task t i we write X k and w k as
a shortcut for X ik and w ( X ik ).
(W1) Sketch We assume that the SVM finds
an optimal hyperplane. The algorithm tries to
minimize both the length of w and the empirical
error. This naturally corresponds to a maximum
margin hyperplane where the weights of irrelevant
features are 0 if enough data points are given.
(W2) SVMs find the optimal hyperplane by
minimizing the weight vector w . Using the optimal
classification hyperplane with weight vector w
can be written as y = sgn ( w 1 x 1 + …+ w i x i +…+
w m x m + b ). We will show that this vector cannot
be changed by adding the same feature more than
Hence, alternative features here do make a
difference, violating (D5).
positive results
We have shown that many common feature weight-
ing algorithms and distance measures cannot be
used for learning task distance in our scenario. In
this section, we will prove that the feature weights
delivered by a linear Support Vector Machine
(SVM) obeys the proposed weighting conditions.
Afterwards, we also discuss a distance measure
fulfilling the distance conditions.
 
Search WWH ::




Custom Search