Classification and Clustering Applications - An Invariant Approach to Statistical Analysis of Shapes

Biology Reference

In-Depth Information

6.5 Cluster analysis

Cluster analysis is concerned with investigating a set of data to discover

whether or not relatively distinct groups of observations can be identified

within the data set. The groups are NOT known a priori , and this distin-

guishes cluster analysis from the activity of allocating individuals to one

of a set of existing groups or classification as described above. Uncovering

the group structure (if any) of a set of data is clearly of considerable

importance in understanding the data and using them to answer sub-

stantive science-based questions.

As in the classification problem, the first step in clustering analysis

is to define a suitable measure of distance. Since the classes are not

known a priori , we calculate the distance between pairs of observa-

tions instead of calculating the distance between an observation and

the parameters of the known classes. Suppose we have N observations.

The calculation of the observation-to-observation distances leads us to

N ( N - 1) /2 pair wise distances between N individuals. These are dis-

similarities between individuals and should not be confused with

inter-landmark distances within an individual. The dissimilarities

between all individuals can be collected in an N

N matrix, called the

matrix of dissimilarities where each row and each column corresponds

to an individual. The dissimilarity metric calculated between individ-

ual a and individual p will be entered into the cell where column a and

row p intersect and again where column p and row a intersect. The

matrix of dissimilarities is a square symmetric matrix that is mathe-

matically similar to the Euclidean Distance Matrix (Form Matrix) that

was used in Chapters 3 to 5 . Given such a matrix of dissimilarities, we

try to construct the groups so that within a group the observations

(measures of dissimilarity) are more similar than they are between

groups. There are many standard statistical procedures (e.g., hierar-

chical clustering or k -means clustering) that may be used towards this

purpose. There are also several standard statistical packages that

implement these procedures. We do not discuss the details of these pro-

cedures here. WinEDMA software ( http://faith.med.jhmi.edu ) offers a

procedure under the heading “ordination.”

We feel obligated to point out that clustering is a very subjective

process. First, a particular distance measure must be chosen. Second,

once the analysis is run, the results of the clustering procedure are

used to decide how many groups exist in the dataset. This requires that

we decide what is meant by 'more' similar.

These subjective choices

Search WWH ::

Custom Search

Home