Databases Reference
In-Depth Information
the feasibility of clustering analysis on a data set and the quality of the results generated
by a clustering method. The major tasks of clustering evaluation include the following:
Assessing clustering tendency . In this task, for a given data set, we assess whether a
nonrandom structure exists in the data. Blindly applying a clustering method on a
data set will return clusters; however, the clusters mined may be misleading. Cluster-
ing analysis on a data set is meaningful only when there is a nonrandom structure in
the data.
Determining the number of clusters in a data set . A few algorithms, such as k -means,
require the number of clusters in a data set as the parameter. Moreover, the number
of clusters can be regarded as an interesting and important summary statistic of a
data set. Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
Measuring clustering quality . After applying a clustering method on a data set, we
want to assess how good the resulting clusters are. A number of measures can be used.
Some methods measure how well the clusters fit the data set, while others measure
how well the clusters match the ground truth, if such truth is available. There are also
measures that score clusterings and thus can compare two sets of clustering results
on the same data set.
In the rest of this section, we discuss each of these three topics.
10.6.1 AssessingClusteringTendency
Clustering tendency assessment determines whether a given data set has a non-random
structure, which may lead to meaningful clusters. Consider a data set that does not have
any non-random structure, such as a set of uniformly distributed points in a data space.
Even though a clustering algorithm may return clusters for the data, those clusters are
random and are not meaningful.
Example10.9 Clustering requires nonuniform distribution of data. Figure 10.21 shows a data set
that is uniformly distributed in 2-D data space. Although a clustering algorithm may
still artificially partition the points into groups, the groups will unlikely mean anything
significant to the application due to the uniform distribution of the data.
“How can we assess the clustering tendency of a data set?” Intuitively, we can try to
measure the probability that the data set is generated by a uniform data distribution.
This can be achieved using statistical tests for spatial randomness. To illustrate this idea,
let's look at a simple yet effective statistic called the Hopkins Statistic.
The Hopkins Statistic is a spatial statistic that tests the spatial randomness of a vari-
able as distributed in a space. Given a data set, D , which is regarded as a sample of
 
Search WWH ::




Custom Search