Biology Reference
In-Depth Information
Chapter 5
Statistical Clustering Analysis: An Introduction
Hang Zhang
Department of Industrial Engineering
Arizona State University
Tempe, Arizona 85287-5906, USA
hang.zhang@asu.edu
Clustering analysis is to segment objects in a dataset into meaningful subsets
such that objects with high similarity are segmented into the same subset, and
objects with low similarity are segmented into different subsets. This chapter
introduces three fundamental but core topics in clustering analysis: the definition
of similarity and dissimilarity measure, the clustering algorithm, and determining
the number of clusters. For each topic, we introduce the ones that are most
popularly used, and emphasize their statistical backgrounds.
5.1. Introduction
Clustering analysis is to group objects in a dataset into subsets such that objects
with high similarity are segmented into the same subset and objects with low
similarity are segmented into different subsets. The grouping results, subsets, are
called clusters.
A dataset to be clustered consists of a collection of objects. An object may be
characterized by a vector of feature values. For example, in a dataset of fish, an
object is just an observation of fish represented by a vector of features such as its
weight, length, color, etc. We name clustering these objects as observation cluster-
ing. An object may also be characterized by a sequence of observations, e.g., the
time series of a stock price in one year. If we want to find the segmentation such
that stocks having high dependency are grouped into the same cluster, and stocks
with low dependency into different clusters, we take each sequence as an object.
Specifically, we call clustering these objects (sequences) as variable clustering.
One question comes up with the definition of clustering analysis: what is a
cluster. The answer to this question varies in different applications of clustering
101
Search WWH ::




Custom Search