Database Reference
In-Depth Information
SEPARATION OF THE CLUSTERS
Analysts also hope for well-separated (well-spaced) clusters:
• A good way of quantifying the cluster separation is to construct a proximity
matrix with the distances between the cluster centroids. The minimum distance
between clusters should be identified and assessed since this distance may
indicate similar clusters that may be merged.
• Analysts may also examine a separation measure named sum of squares between
(SSB), which is based on the (squared Euclidean) distances of each cluster's
centroid to the overall centroid of the whole population. In order to compare
models we can use the average SSB calculated as follows:
N
i
1
dist ( c i , c ) 2
Average SSB
=
N i
C
where c i is the centroid of cluster i , c the overall centroid, N the total cases, and
N i the number of cases in cluster i . The SSB is directly related to the pairwise
distances between the centroids: the higher the SSB, the more separated the
derived clusters.
A combined measure that assesses both the internal cohesion and the external
separation of a clustering solution is the silhouette coefficient, which is calculated
as follows:
1. For each record i in a cluster we calculate a ( i ) as the average (Euclidean)
distance to all other records in the same cluster. This value indicates how well
a specific record fits a cluster. To simplify its computation, the a ( i ) calculation
may be modified to record the (Euclidean) distance of a record from its cluster
centroid.
2. For each record i and for each cluster not containing i as a member, we
calculate the average (Euclidean) distance of the record to all the members of
the neighboring cluster. After doing this for all clusters where i is not a member,
we calculate b ( i ) as the minimum such distance in terms of all clusters. Once
again, to ease computation, the b ( i ) calculation can be modified to denote the
minimum distance between a record and the centroid of every other cluster.
3. The silhouette coefficient for the record i is defined as:
Si
=
[ b ( i )
a ( i )]
/
max
{
a ( i ), b ( i )
}
The silhouette coefficient varies between
1 and 1. Analysts hope for positive
coefficient values, ideally close to 1, as this would indicate a ( i ) values close to 0
and perfect internal homogeneity.
Search WWH ::




Custom Search