Geoscience Reference
In-Depth Information
p(x | y = -1) = 5
=
0 . 2 ×
+ 0 . 8 ×
p(x | y =1)=1.25
p(x)=1
0
1
0
0.2
1
0
0.2
1
=
0 . 6
×
+
0 . 4
×
p(x | y =1)=2.5
p(x | y = -1) = 1.67
0
0.6
1
0
0.6
1
Figure 3.4: An example of unidentifiable models. Even if we know p(x) is a mixture of two uniform
distributions, we cannot uniquely identify the two components. For instance, the two mixtures produce
the same p(x) , but they classify x
0 . 5 differently. Note the height of each distribution represents a
probability density (which can be greater than 1), not probability mass. The area under each distribution
is 1.
=
it. Selecting a better θ ( 0 ) that is more likely to lead to the global optimum (or simply a better local
optimum) is another heuristic method, though this may require domain expertise.
Finally, we note that the goal of optimization for semi-supervised learning with mixture models
is to maximize the log likelihood (3.13). The EM algorithm is only one of several optimization
methods to find a (local) optimum. Direct optimization methods are possible, too, for example
quasi-Newton methods like L-BFGS [ 115 ].
3.6 CLUSTER-THEN-LABELMETHODS
We have used the EM algorithm to identify the mixing components from unlabeled data. Recall
that unsupervised clustering algorithms can also identify clusters from unlabeled data. This suggests
a natural cluster-then-label algorithm for semi-supervised classification.
Algorithm 3.9. Cluster-then-Label.
Input: labeled data ( x 1 ,y 1 ),...,( x l ,y l ), unlabeled data x l + 1 ,..., x l + u ,
a clustering algorithm
A
L
, and a supervised learning algorithm
1. Cluster x 1 ,..., x l + u using
.
2. For each resulting cluster, let S be the labeled instances in this cluster:
3.
A
If S is non-empty, learn a supervised predictor from S: f S = L
(S).
Apply f S to all unlabeled instances in this cluster.
4. If S is empty, use the predictor f trained from all labeled data.
Output: labels on unlabeled data y l + 1 ,...,y l + u .
Search WWH ::




Custom Search