Geoscience Reference
In-Depth Information
2,
k = 4, or more. Without further assumptions, either one is acceptable. Unlike in supervised learning
(introduced in the next section), there is no teacher that tells us which instances should be in each
cluster.
There are many clustering algorithms. We introduce a particularly simple one, hierarchical
agglomerative clustering , to make unsupervised learning concrete.
How many clusters do you find in the little green men data in Figure 1.1? Perhaps k
=
Algorithm 1.6. Hierarchical Agglomerative Clustering.
i = 1 ; a distance function d().
1. Initially, place each instance in its own cluster (called a singleton cluster).
2. while (number of clusters > 1 ) do :
3. Find the closest cluster pair A, B, i.e., they minimize d(A,B).
4. Merge A, B to form a new cluster.
Output : a binary tree showing how clusters are gradually merged from singletons
to a root cluster, which contains the whole training sample.
Input : a training sample
{
x i }
This clustering algorithm is simple. The only thing unspecified is the distance function d() .
If x i , x j
are two singleton clusters, one way to define d( x i , x j ) is the Euclidean distance between
them:
D
d( x i , x j ) =
x i
x j =
(x is x js ) 2 .
(1.1)
s = 1
We also need to define the distance between two non-singleton clusters A, B . There are multiple
possibilities: one can define it to be the distance between the closest pair of points in A and B ,
the distance between the farthest pair, or some average distance. For simplicity, we will use the first
option, also known as single linkage :
A, x B d( x , x ).
d(A,B) =
min
(1.2)
x
It is not necessary to fully grow the tree until only one cluster remains: the clustering algorithm
can be stopped at any point if d() exceeds some threshold, or the number of clusters reaches a
predetermined number k .
Figure 1.2 illustrates the results of hierarchical agglomerative clustering for k
2 , 3 , 4,re-
spectively. The clusters certainly look fine. But because there is no information on how each instance
should be clustered, it can be difficult to objectively evaluate the result of clustering algorithms.
=
1.3
SUPERVISEDLEARNING
Suppose you realize that your alien hosts have a gender: female or male (so they should not all be
called little green men after all). You may now be interested in predicting the gender of a particular
Search WWH ::




Custom Search