Databases Reference
In-Depth Information
X
2
SSE ¼
distðc; xÞ
x2C
where x is an observation in cluster C and c is the cluster centroid. If all
observations are tightly packed around the centroid, the SSE is relatively low.
When observations are spread, the SSE is greater.
Since in a clustering, individual clusters will vary in size (number of
observations), SSE will generally be larger in clusters containing more obser-
vations. To directly compare cohesiveness between clusters we compute the
mean squared error (MSE) of a cluster as:
MSE ¼ SSE
m
where m is the number of observations belonging to the cluster. A special case
to be aware of is the single observation cluster where SSE and MSE will
always be zero.
To compare clusterings - that is clusterings generated by different proximity
based clustering algorithms or multiple executions of the same algorithm - with
respect to overall cluster cohesion we compute the total sum of the squared error
(TSSE) as:
X
TSSE ¼
SSE c
c2CL
where c is a cluster within the full clustering CL . Be forewarned in comparing
clusterings with significantly different cluster counts, the greater the number of
clusters in a clustering, the lower the TSSE. At the extreme, a clustering of one
observation per cluster has a TSSE of zero. Certainly one would not expect this
to be a useful clustering.
An overall measure of cluster separation is the total “between group” sum of
squares (TSSB). TSSB is the sum of the squared distance of cluster centroids
from the dataset overall mean (the dataset centroid) weighted by the number of
observations in the cluster. It is computed as
K
2
TSSB ¼
1 m i
distðc i ; cÞ
where m i
is the number of observations in cluster
i; K is the total number of
clusters,
c i is the centroid of cluster i , and c is the overall dataset centroid. When
comparing clusterings, the greater the TSSB of the clustering, the better the
separation.
Search WWH ::




Custom Search