Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

the feasibility of clustering analysis on a data set and the quality of the results generated

by a clustering method. The major tasks of clustering evaluation include the following:

Assessing clustering tendency . In this task, for a given data set, we assess whether a

nonrandom structure exists in the data. Blindly applying a clustering method on a

data set will return clusters; however, the clusters mined may be misleading. Cluster-

ing analysis on a data set is meaningful only when there is a nonrandom structure in

the data.

Determining the number of clusters in a data set . A few algorithms, such as k -means,

require the number of clusters in a data set as the parameter. Moreover, the number

of clusters can be regarded as an interesting and important summary statistic of a

data set. Therefore, it is desirable to estimate this number even before a clustering

algorithm is used to derive detailed clusters.

Measuring clustering quality . After applying a clustering method on a data set, we

want to assess how good the resulting clusters are. A number of measures can be used.

Some methods measure how well the clusters fit the data set, while others measure

how well the clusters match the ground truth, if such truth is available. There are also

measures that score clusterings and thus can compare two sets of clustering results

on the same data set.

In the rest of this section, we discuss each of these three topics.

10.6.1 AssessingClusteringTendency

Clustering tendency assessment determines whether a given data set has a non-random

structure, which may lead to meaningful clusters. Consider a data set that does not have

any non-random structure, such as a set of uniformly distributed points in a data space.

Even though a clustering algorithm may return clusters for the data, those clusters are

random and are not meaningful.

Example10.9 Clustering requires nonuniform distribution of data. Figure 10.21 shows a data set

that is uniformly distributed in 2-D data space. Although a clustering algorithm may

still artificially partition the points into groups, the groups will unlikely mean anything

significant to the application due to the uniform distribution of the data.

“How can we assess the clustering tendency of a data set?” Intuitively, we can try to

measure the probability that the data set is generated by a uniform data distribution.

This can be achieved using statistical tests for spatial randomness. To illustrate this idea,

let's look at a simple yet effective statistic called the Hopkins Statistic.

The Hopkins Statistic is a spatial statistic that tests the spatial randomness of a vari-

able as distributed in a space. Given a data set, D , which is regarded as a sample of

Search WWH ::

Custom Search

Home