Graphics Reference
In-Depth Information
tions; and, finally, algorithms leading to partitions, such as the methods of clustering
about moving centers or other minimum variance algorithms.
Computational Issues
4.6
ComputationalaspectsinstatisticaldataminingaretreatedcomprehensivelybyWeg-
man and Solka ( ); we refer to their description of computational complexity to
better understand the impact of large and massive datasets in the MDA approaches.
Most critical issues become apparent when applying cluster analysis methods. As
amatter offact,the necessary computational efforttoattain results inFAdependson
the number of variables. Underthe common situation inwhichthe number of statis-
tical units is muchlarger than the number of variables, the computation of asolution
can be carried out on the matrix of order p
p,wherep indicates the number of
columns. Looking at ( . ), it is straightforward to notice that the problem in R p has
a very feasible computational complexity, of the order O
p
.Withverylowcom-
putational effort, transition formulae (Lebartetal., )permitonetocomputethe
results in R n .
Hierarchical clustering algorithms, conversely, are very time consuming, as the
computational effort for such algorithms is of the order O
(
)
m
,wherem denotes
the number of entries in the data matrix. According to Wegman and Solka ( ),
using a Pentium IV -GHz machine with -gigaflop performance assumed, the time
required forclusteringadataset witha medium numberofentries ( bytes) isabout
min, while about d are required to handle a large number of entries ( bytes).
When the dataset size rises to bytes (huge)ittakes years!
Nonhierarchical clustering algorithms offer good performance with decent com-
putation time even with huge datasets. In the following subsections, we briefly intro-
duce partitioning methods and describe a mixed two-step strategy (nonhierarchical
+ hierarchical) largely used in a MDA framework.
In Sect. . we will show how advanced graphical representations can add useful
information to a factorial plan and how the human-machine interaction helps us to
navigate throughout the data in search of interesting patterns.
(
)
Partitioning Methods
4.6.1
Nonhierarchical clustering attempts todirectly decompose a dataset into a set of dis-
joint clusters of similar data items. he partition is obtained through the minimiza-
tion of a chosen measure of dissimilarity. In particular, taking into account the vari-
ance decomposition, the method aims to minimize the ratio
trace
(
W
)
Q
=
( . )
trace
(
T
)
where trace
denote the within groups and total variance-covari-
ance matrices, respectively. According to the variance decomposition formula,
(
W
)
and trace
(
T
)
Search WWH ::




Custom Search