HugeMultidimensional Data Visualization: Back to the Virtue of Principal Coordinates and Dendrograms in the New Computer Age - Data Visualization

Graphics Reference

In-Depth Information

tions; and, finally, algorithms leading to partitions, such as the methods of clustering

about moving centers or other minimum variance algorithms.

Computational Issues

4.6

ComputationalaspectsinstatisticaldataminingaretreatedcomprehensivelybyWeg-

man and Solka ( ); we refer to their description of computational complexity to

better understand the impact of large and massive datasets in the MDA approaches.

Most critical issues become apparent when applying cluster analysis methods. As

amatter offact,the necessary computational efforttoattain results inFAdependson

the number of variables. Underthe common situation inwhichthe number of statis-

tical units is muchlarger than the number of variables, the computation of asolution

can be carried out on the matrix of order p

p,wherep indicates the number of

columns. Looking at ( . ), it is straightforward to notice that the problem in R p has

a very feasible computational complexity, of the order O

p

.Withverylowcom-

putational effort, transition formulae (Lebartetal., )permitonetocomputethe

results in R n .

Hierarchical clustering algorithms, conversely, are very time consuming, as the

computational effort for such algorithms is of the order O

(

)

m

,wherem denotes

the number of entries in the data matrix. According to Wegman and Solka ( ),

using a Pentium IV -GHz machine with -gigaflop performance assumed, the time

required forclusteringadataset witha medium numberofentries ( bytes) isabout

min, while about d are required to handle a large number of entries ( bytes).

When the dataset size rises to bytes (huge)ittakes years!

Nonhierarchical clustering algorithms offer good performance with decent com-

putation time even with huge datasets. In the following subsections, we briefly intro-

duce partitioning methods and describe a mixed two-step strategy (nonhierarchical

+ hierarchical) largely used in a MDA framework.

In Sect. . we will show how advanced graphical representations can add useful

information to a factorial plan and how the human-machine interaction helps us to

navigate throughout the data in search of interesting patterns.

(

)

Partitioning Methods

4.6.1

Nonhierarchical clustering attempts todirectly decompose a dataset into a set of dis-

joint clusters of similar data items. he partition is obtained through the minimiza-

tion of a chosen measure of dissimilarity. In particular, taking into account the vari-

ance decomposition, the method aims to minimize the ratio

trace

(

W

)

Q

=

( . )

trace

(

T

)

where trace

denote the within groups and total variance-covari-

ance matrices, respectively. According to the variance decomposition formula,

(

W

)

and trace

(

T

)

Data Visualization

Search WWH ::

Custom Search

Home