Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

Fig. 16.2. An illustration of a bi-dimensional dataset

Note that calculating VI for all the possible mergings between k clusters will

lead to a high complexity of O ( k 2 )at each level of the process. We overcome

this by considering, at a given level, only the M closest pairs of clusters 5 ,since

they form the most potential candidates to improve VI .

Context Space Composition. For each new cluster candidate C p ,aContext

Risk CR expresses how risky can be assessing C p for the overall clustering quality

in the expected upcoming mergings given the context of C p . Consider the two

new clusters candidates C p and C q depicted by their centroids respectively in

Figures 16.3 and 16.4 with five context clusters each ( C 1 ...C 5 ). We assume that

C p ,withits K -NN ( K = 5) neither too close nor too distant from its centroid,

is more risky than C q ,withits K -NN either too close or too distant from its

centroid. Therefore, we define a context space as the space including the K -NN

of a new cluster candidate C p . Then, as shown in Figure 16.3, we decompose the

context space of C p into three layers that we define below.

Intra layer. Clusters within this layer reduce CR as they should not lead to a

quick drop in VI .Forthis,theyhavetobe close enough to C p , therefore, likely to

be merged with C p in next iterations without causing a significant degradation

(comparing to the previous mergings) in the global intra-cluster compactness. As

a matter of fact, the clusters are getting larger over mergings, and thus the intra-

cluster is continuously deteriorating. At a level k where FD did not occurred

yet, we suppose that all the previous mergings that caused degradations in the

intra-cluster are acceptable. This layer is delimited by the thresholds t 0=0

and t 1= δ ( C p ). We define δ ( C p ) as the radius of the new cluster candidate C p

augmented by the standard deviation of radius values obtained following the

previous mergings. A radius is the maximum distance between the centroid of

C p and an element within C p .

δ ( C p )= radius ( C p )+ Std ( radius ( C n ...C k− 1 ))

5 We set M =10inourexperiments.

Search WWH ::

Custom Search

Home