Visualizing Cluster Analysis and Finite Mixture Models - Data Visualization

Graphics Reference

In-Depth Information

Figure . shows a dendrogram for the dentitio data created using an agglomer-

ative algorithm utilizing Manhattan distance and the average linkage method. Man-

hattan distance was chosen because it has a direct interpretation for these data: it

equals the difference between the teeth-group counts for two given species.

First we look for animals separated by the minimal distance. Obviously the mini-

mum possible difference is zero, i.e., animals with identical teeth configurations like

the two bats shown in the last two rows of Table . . Animals separated by zero dis-

tance are depicted by vertical lines immediately to the let of their names, as shown

for mink, weasel, ferret, badger and skunk at the top of the graph. Ater all of the an-

imals separated by zero distance have been found and connected, those groups that

arethenextsmallestaveragedistanceapartarejoined.Inourcase,thenextsmallest

distance possible isa differenceof one tooth, as seen forthe pygmybat and the house

bat (also shown in Table . ).

Notethat theactual layoutofadendrogramisnotunique,because ateachbranch-

ing point the top and bottom branches could be exchanged. For example, the ar-

madillo, which represents an outlier here because it has eight molars and no other

types of teeth, could have equally well been placed at the top of the graph. N

branchingpointsareneeded toconnectall N data points,sothetotal numberofden-

drograms that could be drawn for exactly the same clustering is N − . his is much

smaller than all possible permutations (N!), but is still quite a large number. here-

fore, many sotware packages that perform hierarchical clustering allow the user to

rearrange the observations, either manually or by specifying an ordering function.

−

Heatmaps

11.2.2

Of course, we can cluster the variables as well as the observations. For example, we

mightbeinterestedinwhethertheanimalsdiffermoreintermsoftypeortop/bottom

jaw. At the top of Fig. . is a dendrogram for the variables sorted as they appear

in the data set, i.e., with the top and bottom of each type next to each other. he

original sorting of the data is compatible with hierarchical clustering of the variables

(as depicted bythe dendrogram), because there are no crossing lines in the tree. his

leads to the (rather obvious) conclusion that the variables for the same type of teeth

on the top and bottom jaws are very similar.

Figure . is a so-called cluster heatmap. he main part is an image plot of the

original data, whereeachcellinthematrix correspondstoavalue intheoriginal data

set. Columns and rows are permuted to conform with the hierarchical clustering of

variables andobservations; thecorrespondingdendrogramsareplacedtotheletand

on top of the matrix, respectively.

Manyimportantfeaturesofthisdatasetcanbeeasilypickedoutusingtheheatmap

representation. he strongest patterns are the four “vertical stripes” for each of the

four types of teeth, because many animals have the same (or very similar) counts on

the top and bottom jaws. We can also see that the number of canines in general is

rather low, while the other three tooth types show “blocks” of animals with either

high or low counts. For example, the predators in the upper rows have larger in-

cisor and premolar counts, while the rodents in the bottom rows have more molars.

Data Visualization

Search WWH ::

Custom Search

Home