from a large dataset to support data reduction, identifying natural
clusters in a dataset to give insight into what cases are grouped
together, as well as finding data that either does not belong to any of
the found clusters or belongs to a cluster of only a few cases provid-
ing a kind of outlier or anomaly detection [DEI 2005]. The data for-
mat used for clustering is similar to that used for supervised
learning, except that no target attribute is specified.
Essentially, clustering analysis identifies clusters that exist in a given
dataset, where a cluster is a collection of cases that are more similar to
one another than cases in other clusters. A set of clusters is considered
to be of high quality if the similarity between clusters is low, yet the
similarity of cases within a cluster is high [Anderberg 1973]. As illus-
trated in Figure 4-8, it can be fairly easy to understand clusters in two
dimensions. Here we have two numerical attributes: income and age .
The figure depicts two clusters each with its respective centroid , that
is, representative center point. Cluster C1 corresponds to individuals
with lower income and lower age, whereas Cluster C2 corresponds to
individuals with higher income and higher age. If we look at the his-
tograms of these attributes as illustrated in Figure 4-9, we see the
number of cases is highest closest to the centroid of each cluster.
With this simple example, there is no need to use a data mining
algorithm to identify the clusters—visual inspection can easily identify
the clusters once the data is graphed. With advanced visualization