Biology Reference
In-Depth Information
6.5 Cluster analysis
Cluster analysis is concerned with investigating a set of data to discover
whether or not relatively distinct groups of observations can be identified
within the data set. The groups are NOT known a priori , and this distin-
guishes cluster analysis from the activity of allocating individuals to one
of a set of existing groups or classification as described above. Uncovering
the group structure (if any) of a set of data is clearly of considerable
importance in understanding the data and using them to answer sub-
stantive science-based questions.
As in the classification problem, the first step in clustering analysis
is to define a suitable measure of distance. Since the classes are not
known a priori , we calculate the distance between pairs of observa-
tions instead of calculating the distance between an observation and
the parameters of the known classes. Suppose we have N observations.
The calculation of the observation-to-observation distances leads us to
N ( N - 1) /2 pair wise distances between N individuals. These are dis-
similarities between individuals and should not be confused with
inter-landmark distances within an individual. The dissimilarities
between all individuals can be collected in an N
N matrix, called the
matrix of dissimilarities where each row and each column corresponds
to an individual. The dissimilarity metric calculated between individ-
ual a and individual p will be entered into the cell where column a and
row p intersect and again where column p and row a intersect. The
matrix of dissimilarities is a square symmetric matrix that is mathe-
matically similar to the Euclidean Distance Matrix (Form Matrix) that
was used in Chapters 3 to 5 . Given such a matrix of dissimilarities, we
try to construct the groups so that within a group the observations
(measures of dissimilarity) are more similar than they are between
groups. There are many standard statistical procedures (e.g., hierar-
chical clustering or k -means clustering) that may be used towards this
purpose. There are also several standard statistical packages that
implement these procedures. We do not discuss the details of these pro-
cedures here. WinEDMA software ( http://faith.med.jhmi.edu ) offers a
procedure under the heading “ordination.”
We feel obligated to point out that clustering is a very subjective
process. First, a particular distance measure must be chosen. Second,
once the analysis is run, the results of the clustering procedure are
used to decide how many groups exist in the dataset. This requires that
we decide what is meant by 'more' similar.
These subjective choices
Search WWH ::




Custom Search