Databases Reference
In-Depth Information
Our null hypothesis is the homogeneous hypothesis —that D is uniformly distributed
and thus contains no meaningful clusters. The nonhomogeneous hypothesis (i.e., that D
is not uniformly distributed and thus contains clusters) is the alternative hypothesis.
We can conduct the Hopkins Statistic test iteratively, using 0.5 as the threshold to reject
the alternative hypothesis. That is, if H
>
0.5, then it is unlikely that D has statistically
significant clusters.
10.6.2 DeterminingtheNumberofClusters
Determining the “right” number of clusters in a data set is important, not only because
some clustering algorithms like k -means require such a parameter, but also because the
appropriate number of clusters controls the proper granularity of cluster analysis. It can
be regarded as finding a good balance between compressibility and accuracy in cluster
analysis. Consider two extreme cases. What if you were to treat the entire data set as a
cluster? This would maximize the compression of the data, but such a cluster analysis
has no value. On the other hand, treating each object in a data set as a cluster gives
the finest clustering resolution (i.e., most accurate due to the zero distance between an
object and the corresponding cluster center). In some methods like k -means, this even
achieves the best cost. However, having one object per cluster does not enable any data
summarization.
Determining the number of clusters is far from easy, often because the “right” num-
ber is ambiguous. Figuring out what the right number of clusters should be often
depends on the distribution's shape and scale in the data set, as well as the cluster-
ing resolution required by the user. There are many possible ways to estimate the
number of clusters. Here, we briefly introduce a few simple yet popular and effective
methods.
A simple method is to set the number of clusters to about q 2
for a data set of n
points. In expectation, each cluster has p 2 n points.
The elbow method is based on the observation that increasing the number of clusters
can help to reduce the sum of within-cluster variance of each cluster. This is because
having more clusters allows one to capture finer groups of data objects that are more
similar to each other. However, the marginal effect of reducing the sum of within-cluster
variances may drop if too many clusters are formed, because splitting a cohesive cluster
into two gives only a small reduction. Consequently, a heuristic for selecting the right
number of clusters is to use the turning point in the curve of the sum of within-cluster
variances with respect to the number of clusters.
Technically, given a number, k
0, we can form k clusters on the data set in ques-
tion using a clustering algorithm like k -means, and calculate the sum of within-cluster
variances, var
>
. We can then plot the curve of var with respect to k . The first (or most
significant) turning point of the curve suggests the “right” number.
More advanced methods can determine the number of clusters using information
criteria or information theoretic approaches. Please refer to the bibliographic notes for
further information (Section 10.9).
.
k
/
 
Search WWH ::




Custom Search