Geoscience Reference
In-Depth Information
Partially labeled data
Single linkage clustering
Predicted labeling
70
70
70
60
60
60
50
50
50
40
40
40
80
90
100
110
80
90
100
110
80
90
100
110
weight (lbs.)
weight (lbs.)
weight (lbs.)
Figure 3.5: Cluster-then-label results using single linkage hierarchical agglomerative clustering (
A
) and
majority vote (
L
).
A
is unsupervised. In step 2, we learn one supervised
predictor using the labeled instances that fall into each cluster, and use the predictor to label the
unlabeled instances in that cluster. One can use any clustering algorithm
In step 1, the clustering algorithm
A
L
.
It is worth noting that cluster-then-label does not necessarily involve a probabilistic mixture model.
The following example shows cluster-then-label with Hierarchical Agglomerative Clustering
and supervised learner
for
A
, and simple majority vote within each cluster for
L
.
Example 3.10. Cluster-then-Label withHierarchical Agglomerative Clustering In this exam-
ple, we apply the cluster-then-label approach to the aliens data from Chapter 1. For step 1, we use
hierarchical agglomerative clustering (Algorithm 1.1) with Euclidean distance as the underlying
distance function, and the single linkage method to determine distances between clusters. Because
the labeled data contains only two classes, we heuristically stop the algorithm once only two clusters
remain. Steps 2-4 of cluster-then-label find the majority label within each cluster, and assign this
label to all unlabeled instances in the cluster.
Figure 3.5 shows the original partially labeled data, the two clusters, and the final labels
predicted for all the data. In this case, because the clusters coincide with the true labeling of the data,
we correctly classify all unlabeled instances. The lines between instances in the center panel indicate
the linkages responsible for merging clusters until only two remain.
It turns out that using single linkage is critically important here, where the natural clusters
are long and skinny. A different choice—e.g., complete linkage clustering—tends to form rounder
clusters. When applied to this data, complete linkage produces the results shown in Figure 3.6. The
clusters found by complete linkage do not match up with the true labeling of the data. In fact, both
labeled instances end up in the same cluster. Because there is no majority label in either cluster,
the majority vote algorithm ends up breaking ties randomly, thus assigning random labels to all
unlabeled instances.
The point of this example is not to show that complete linkage is bad and single linkage is
good. In fact, it could have been the other way around for other datasets! Instead, the example is
 
Search WWH ::




Custom Search