HugeMultidimensional Data Visualization: Back to the Virtue of Principal Coordinates and Dendrograms in the New Computer Age - Data Visualization

Graphics Reference

In-Depth Information

byaggregating twoelements, usingcriteria linked tovariance. hecomputations

are not time consuming when the clustering is performed ater a factorial analy-

sis(PCAorMCA)andtheobjectstobeclassifiedarelocatedbytheircoordinates

on the first axes of the analysis.

Step : final partition

he partition of the population is defined by cutting the dendrogram. Choosing

the level of the cut, and thus the number of classes in the partition, can be done

by looking at the tree: the cut has to be made above the low aggregations, which

bringtogethertheelementsthatareveryclosetooneanother,andunderthehigh

aggregations, which lump together all the various groups in the population.

Some Considerations on the MIXED strategy

Classifying a large dataset is a complex task, and it is di cult to find an algorithm

that alone will lead to an optimal result. he proposed strategy, which is not entirely

automatic and which requires several control parameters, allows us to retain control

over the classification process. he procedure below illustrates an exploratory strat-

egy allowing the definition of satisfactory partition(s) of data. It is weakly affected by

thenumberofunitsandcanoffergoodresultsinafairlyreasonabletime.InMDAap-

plications on real datasets, especially in cases of huge databases, much experience is

required to effectively tune the procedure parameters (Confais and Nakache, ).

A good compromise between accuracy of results and computational time can be

achieved by using the following parameters:

. he number of basic partitionings, which through cross-tabulation define the

stable groups (usually two or three basic partitionings);

. he number of groups in each basic partitioning (approximately equal to the

unknown number of “real” groups, usually between and );

. he number of iterations to accomplish each basic partitioning (less than five is

usually su cient);

. he number of principal coordinates used to compute any distance and aggre-

gation criterion (depending on the decrease of the eigenvalues of principal axis

analysis: usually between and for a large number of variables);

. Finally, the cut level of the hierarchical tree in order to determine the number of

final groups (in general, by visual inspection).

Nearest-neighbor-accelerated algorithms for hierarchical classification permitone to

directly build atree on the entire population. However,these algorithms cannot read

the data matrix sequentially. he data, which usually are the first principal coordi-

nates of a preliminary analysis, must be stored in central memory. his is not a prob-

lem when the tree is built on the stable groups of a preliminary k-means partition

(also computed on the first principal axes). Besides working with direct reading, the

partitioning algorithm has another advantage. he criterion of homogeneity of the

groups is better satisfied in finding an optimal partition rather than in the morecon-

strained case of finding an optimal family of nested partitions (hierarchical tree). In

addition, building stable groups constitutes a sort of self-validation of the classifica-

tion procedure.

Data Visualization

Search WWH ::

Custom Search

Home