Geoscience Reference
In-Depth Information
Finally, we point out a computational difficulty of S3VMs. The S3VM objective func-
tion (6.16) is
non-convex
. A function
g
is convex, if for
∀
z
1
,z
2
,
∀
0
≤
λ
≤
1,
g(λz
1
+
(
1
−
λ)z
2
)
≤
λg(z
1
)
+
(
1
−
λ)g(z
2
).
(6.21)
For example, the SVM objective (6.12) is a convex function of the parameters
w
,b
. This can be
verified by the convexity of the hinge loss, the squared norm, and the fact that the sum of convex
functions is convex. Minimizing a convex function is relatively easy, as such a function has a well-
defined “bottom.” On the other hand, the hat loss function is non-convex, as demonstrated by
z
1
=
−
0
.
5. With the sum of a large number of hat functions, the S3VM objective (6.16)
is non-convex with multiple local minima. A learning algorithm can get trapped in a sub-optimal
local minimum, and not find the global minimum solution. The research in S3VMs has focused on
how to efficiently find a near-optimum solution; some of this work is listed in the bibliographical
notes.
1
,z
2
=
1, and
λ
=
ENTROPY REGULARIZATION
∗
6.3
SVMs and S3VMs are non-probabilistic models. That is, they are not designed to compute the
label posterior probability
p(y
x
)
when making classification. In statistical machine learning, there
are many probabilistic models which compute
p(y
|
|
x
)
from labeled training data for classification.
Interestingly, there is a direct analogue of S3VM for these probabilistic models too, known as entropy
regularization. To make our discussion concrete, we will first introduce a particular probabilistic
model: logistic regression, and then extend it to semi-supervised learning via entropy regularization.
Logistic regression models the posterior probability
p(y
|
x
)
. Like SVMs, it uses a linear deci-
w
x
sion function
f(
x
)
=
+
b
. Let the label
y
∈{−
1
,
1
}
. Recall that if
f(
x
)
0,
x
is deep within
the positive side of the decision boundary, if
f(
x
)
0,
x
is deep within the negative side, and
f(
x
)
=
0 means
x
is right on the decision boundary with maximum label uncertainty. Logistic
regression models the posterior probability by
p(y
|
x
)
=
1
/ (
1
+
exp
(
−
yf (
x
))) ,
(6.22)
which “squashes”
f(
x
)
∈
(
−∞
,
∞
)
down to
p(y
|
x
)
∈[
0
,
1
]
. The model parameters are
w
and
b
,
i
=
1
, the conditional log likelihood is defined
like in SVMs. Given a labeled training sample
{
(
x
i
,y
i
)
}
as
l
log
p(y
i
|
x
i
,
w
,b).
(6.23)
i
=
1
If we further introduce a Gaussian distribution as the prior on
w
:
w
∼
N
(
0
,I/(
2
λ))
(6.24)