HugeMultidimensional Data Visualization: Back to the Virtue of Principal Coordinates and Dendrograms in the New Computer Age - Data Visualization

Graphics Reference

In-Depth Information

minimizing the quantity in ( . ) is equivalent to maximizing the quantity

trace

(

)

trace

(

)

. Notice that trace() indicates the trace operator, trace

(

trace

(Mardia et al., ).

he algorithm family known as k-means is the most widely known nonhierarchi-

cal clustering approach; among the different k-means algorithms, that proposed by

Forgy is the most widely implemented in specialized sotware packages (Hartigan,

Letusassumepartitioning ofasetof n unitscharacterizedby p variables; themain

steps of a k-means algorithm are the following:

Step : k provisional group centers are randomly determined: c , c ,..., c k .

Step : a partition P

(

trace

(

)

C , C ,...,C k

of the n objects into k clustersis obtained

using the assignment rule: a unit belongs to C k if it is nearer to c k than to all

other centers; an object x i is then assigned to the cluster C k if d

x i , c k

min.

Step : k new cluster centers are determined: c , c ,..., c k as the centers of gravity

of the clusters of the partition P .Newcenters are then usedtodefine a newpar-

tition P

(

constructedaccordingtothesamerulesusedfor P .

he previous steps are iterated until convergence to a final partition, i.e., when two

succeeding iterations lead to the same partition or when a chosen criterion is ob-

tained that describes the quality of the obtained partition. A further stopping crite-

rion can be based on the number of iterations.

C , C ,...,C k

Stable Groups Identiication

Although it is based on a relatively slight theoretical basis, the k-means classification

method has an e cacy largely attested to by empirical results (Milligan, ). As

aconsequence,itisthepartitioningmethodbest adapted to large datasets.hek-

means method is used as an adjunct to other methods such as clustering prior to

principal axis analysis or directly as a descriptive tool. he method is particularly

well adapted to large numerical datasets because the data are read directly:thedata

matrix is saved on auxiliary memory and is read several times in sequential fashion,

without requiring large amounts of computer memory.

As the within-group variance can only become smaller between two steps of iter-

ations, the algorithm does converge; however, it is not the convergence itself but its

very high rate of convergence that justifies using this method in practice.

Generally, obtained partitions depend on the centers chosen at the first iteration.

he algorithm could converge toward local optima. he procedure of finding stable

groups (Lebart et al., ) is a kind of remedy for this situation. Its main advantage

is that it elaborates the results obtained in the rigid framework of a single partition

by highlighting high-density regions of the object points. he technique consists in

performing several partitions starting with different sets of centers and keeping as

stable groups the sets of objects that are always assigned to the same cluster.

Let us consider, as a small illustrative example, partitioning individuals into

several homogeneous groups. We perform a first basic partitioning into groups

around moving centers (only a few iterations are necessary to ensure stability of the

groups). his procedure is repeated three times.

Data Visualization

Search WWH ::

Custom Search

Home