Geoscience Reference
In-Depth Information
p(x
|
y = -1) = 5
=
0
.
2
×
+
0
.
8
×
p(x
|
y =1)=1.25
p(x)=1
0
1
0
0.2
1
0
0.2
1
=
0
.
6
×
+
0
.
4
×
p(x
|
y =1)=2.5
p(x
|
y = -1) = 1.67
0
0.6
1
0
0.6
1
Figure 3.4:
An example of unidentifiable models. Even if we know
p(x)
is a mixture of two uniform
distributions, we cannot uniquely identify the two components. For instance, the two mixtures produce
the same
p(x)
, but they classify
x
0
.
5 differently. Note the height of each distribution represents a
probability density (which can be greater than 1), not probability mass. The area under each distribution
is 1.
=
it. Selecting a better
θ
(
0
)
that is more likely to lead to the global optimum (or simply a better local
optimum) is another heuristic method, though this may require domain expertise.
Finally, we note that the goal of optimization for semi-supervised learning with mixture models
is to maximize the log likelihood (3.13). The EM algorithm is only one of several optimization
methods to find a (local) optimum. Direct optimization methods are possible, too, for example
quasi-Newton methods like L-BFGS [
115
].
3.6 CLUSTER-THEN-LABELMETHODS
We have used the EM algorithm to identify the mixing components from unlabeled data. Recall
that unsupervised clustering algorithms can also identify clusters from unlabeled data. This suggests
a natural
cluster-then-label
algorithm for semi-supervised classification.
Algorithm 3.9. Cluster-then-Label.
Input: labeled data (
x
1
,y
1
),...,(
x
l
,y
l
), unlabeled data
x
l
+
1
,...,
x
l
+
u
,
a clustering algorithm
A
L
, and a supervised learning algorithm
1. Cluster
x
1
,...,
x
l
+
u
using
.
2. For each resulting cluster, let S be the labeled instances in this cluster:
3.
A
If S is non-empty, learn a supervised predictor from S: f
S
=
L
(S).
Apply f
S
to all unlabeled instances in this cluster.
4. If S is empty, use the predictor f trained from all labeled data.
Output: labels on unlabeled data y
l
+
1
,...,y
l
+
u
.