Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Our null hypothesis is the homogeneous hypothesis —that D is uniformly distributed

and thus contains no meaningful clusters. The nonhomogeneous hypothesis (i.e., that D

is not uniformly distributed and thus contains clusters) is the alternative hypothesis.

We can conduct the Hopkins Statistic test iteratively, using 0.5 as the threshold to reject

the alternative hypothesis. That is, if H

>

0.5, then it is unlikely that D has statistically

significant clusters.

10.6.2 DeterminingtheNumberofClusters

Determining the “right” number of clusters in a data set is important, not only because

some clustering algorithms like k -means require such a parameter, but also because the

appropriate number of clusters controls the proper granularity of cluster analysis. It can

be regarded as finding a good balance between compressibility and accuracy in cluster

analysis. Consider two extreme cases. What if you were to treat the entire data set as a

cluster? This would maximize the compression of the data, but such a cluster analysis

has no value. On the other hand, treating each object in a data set as a cluster gives

the finest clustering resolution (i.e., most accurate due to the zero distance between an

object and the corresponding cluster center). In some methods like k -means, this even

achieves the best cost. However, having one object per cluster does not enable any data

summarization.

Determining the number of clusters is far from easy, often because the “right” num-

ber is ambiguous. Figuring out what the right number of clusters should be often

depends on the distribution's shape and scale in the data set, as well as the cluster-

ing resolution required by the user. There are many possible ways to estimate the

number of clusters. Here, we briefly introduce a few simple yet popular and effective

methods.

A simple method is to set the number of clusters to about q 2

for a data set of n

points. In expectation, each cluster has p 2 n points.

The elbow method is based on the observation that increasing the number of clusters

can help to reduce the sum of within-cluster variance of each cluster. This is because

having more clusters allows one to capture finer groups of data objects that are more

similar to each other. However, the marginal effect of reducing the sum of within-cluster

variances may drop if too many clusters are formed, because splitting a cohesive cluster

into two gives only a small reduction. Consequently, a heuristic for selecting the right

number of clusters is to use the turning point in the curve of the sum of within-cluster

variances with respect to the number of clusters.

Technically, given a number, k

0, we can form k clusters on the data set in ques-

tion using a clustering algorithm like k -means, and calculate the sum of within-cluster

variances, var

>

. We can then plot the curve of var with respect to k . The first (or most

significant) turning point of the curve suggests the “right” number.

More advanced methods can determine the number of clusters using information

criteria or information theoretic approaches. Please refer to the bibliographic notes for

further information (Section 10.9).

.

k

/

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home