CLUSTER ANALYSIS (Social Science)

Quantitative social science often involves measurements of several variables for a number of cases (individuals or subjects). Searching for groupings, or clusters, is an important exploratory technique. Grouping can provide a means for summarizing data, identifying outliers, or suggesting questions to study.

A well-known clustering is that of stars into a main sequence, white giants, and red dwarfs, according to temperature and luminosity. The military has used cluster analysis of anthropometric data to reduce the number of different uniform sizes kept in inventory. Cluster analysis in marketing is called market segmentation; consumers are clustered according to psychographic, demographic, and purchasing behavior variables. The United States has been divided into a number of clusters according to lifestyle and buying habits.

Establishing the profile of a case, an observational unit, is the first step in cluster analysis. The profile of a case is its pattern of scores across a set of correlated variables. Cases with similar profiles should be in the same cluster; cases with disparate profiles, in different clusters. The mean profile of a cluster is the centroid, the set of means of the variables, for the individuals in that cluster. Cluster profiles provide a good summary of the data. Examining them provides insight as to what the clusters mean. A cluster’s profile can suggest an interpretation and a name for it.

There are two broad types of clustering algorithms: hierarchical clustering and nonhierarchical clustering (partitioning). Hierarchical clustering follows one of two approaches. Agglomerative clustering starts with each case as a unique cluster, and with each step combines cases to form larger clusters until there is only one or a few larger clusters. Divisive clustering begins with one large cluster and splits it into smaller clusters.


There are several ways to define intercluster distance. This can be done by forming all pairs of objects, with one object in one cluster and one in the other, and computing the distances between the members of these pairs. Single linkage is based on the shortest of these; complete linkage on the longest; and average linkage on their mean. Joe Ward’s method (1963) is based on the sum of squares between the two clusters, summed over all variables. The centroid method is based on the distance between cluster centroids.

Nonhierarchical clustering is partitioning of the sample. The K-means algorithm assigns each case to the cluster having the nearest centroid. The process begins by partitioning the cases into K initial clusters and assigning each case to the cluster whose centroid is nearest. The cen-troids of the cluster receiving the new case and the cluster losing the case are updated. This is repeated until no more reassignments take place. The ISODATA algorithm is similar to K-means, except one loops through all cases before the centroids are updated. An alternative to starting with an initial clustering is to start with an initialization of the centroids—for example, as the first K cases in the dataset or as Kcases randomly chosen from it.

The notion of nearest requires a notion of distance. Often, rightly or wrongly, researchers use Euclidean distance, which is the length of the hypotenuse of a right triangle formed between the points. Euclidean distance is appropriate for variables that are uncorrelated and have equal variances. Standardization of the data is needed if the range or scale of one variable is much larger than that of others. Mahalanobis distance (statistical distance), which adjusts for different variances and for the correlations among the variables, is preferred.

It is sometimes suggested that researchers start with hierarchical clustering to generate initial centroids, and then use nonhierarchical clustering. A conceptual model for clustering is that the sample comes from a mixture of several populations. This leads to a mathematical probability model called the finite mixture model. If the within-cluster type of distribution is specified (such as multivariate normal), then the method of maximum likelihood can be used to estimate the parameters. This is done with an iterative algorithm.

There are several procedures for determining the number of clusters. This task should be guided by substantive theory and the practicality of the results. A criterion such as between-groups sum of squares or likelihood can be plotted against the number of clusters in a scree plot. When a normal mixture model is used, model selection criteria such as Akaike information criterion (AIC) and Bayesian information criterion (BIC) can be used.

Once the clusters are formed, researchers can use discriminant analysis to determine which variables account for the clustering and to classify new cases into the clusters. Some cluster techniques operate on distances or similarities rather than raw data. Variables can be clustered using their correlations as similarities. Simultaneous clustering of cases and variables is called block clustering. If a subset of the cases has similar values on a subset of the variables, these cases and variables form a block.

James MacQueen’s development of his K-means algorithm (1967) was a milestone in the development of cluster analysis. John Wolfe (1970) was the first to program maximum likelihood clustering for the finite normal mixture model. John Hartigan’s Clustering Algorithms (1975) did much to stimulate interest in cluster analysis. Geoff McLachlan and David Peel’s Finite Mixture Models (2000) is a comprehensive presentation of model-based clustering.

Next post:

Previous post: