Exploring Validity Indices for Clustering Textual Data - Mining Complex Data

Information Technology Reference

In-Depth Information

significant amelioration in the inter-cluster separation since they are not enough

distant from C p . This layer is delimited by the thresholds t 1and t 2previously

defined.

Context Risk Calculation. In order to calculate CR for a candidate cluster

C p ,weusethefollowingformula:

n 1

n 2

n 3

CR ( C p )= 1

K (

R ( C i ,C p )

−

I 1( C j ,C p )

−

I 2( C h ,C p ))

i =1

j =1

h =1

where R ( C i ,C p ), I 1( C j ,C p ), I 2( C h ,C p ) denote the score given for a cluster C p

situated respectively in the risk layer, intra layer, and inter layer. All the scores

are distributed along a [0,1] range according to their distances with the centroid

of C p (See Figures 16.3 and 16.4). n 1, n 2, n 3 denote the number of clusters

situated respectively in the risk, intra and inter layers. Consequently, CR varies

between -1 (for a minimal risk) and 1 (for a maximal risk). For a contextual

cluster C x having a distance d x with S p , its score is calculated with respect to

the following conditions:

⎧

⎨

⎫

⎬

I 1( C x ,C p )= t 1 −d x

t 1

d x <t 1

then

I 2( C x ,C p )= d x −t 2

t 1

else

d x >t 2

then

else

d x >t 3

then

I 2( C x ,C p )=1

⎩

⎭

R ( C x ,C p )= d x −t 1

t 1 / 2

else

t 1 ≤ d x < ( t 1+ t 2) / 2 then

R ( C x ,C p )= t 2 −d x

t 1 / 2

else

( t 1+ t 2) / 2 ≤ d x ≤ t 2 then

In terms of complexity, CR cannot be considered as computationally expen-

sive. In fact, given K nearest neighbors, and p clusters at a specific iteration,

we compute CR for M candidate clusters resulted from the merging of the M

closest pairs of clusters. Therefore, at a given iteration, the added complexity to

a clustering algorithm is O ( M ( p

1) / 2)). Furthermore, we argue

that the parameters M and K have “second order” effect on the results. In other

words, they are not 'critical' parameters, and their choices depend solely on the

extent to which we are able to augment the complexity of the algorithm.

−

1+ K + K ( K

−

16.4

Experimental Study on Relative Validity Indices

16.4.1

Motivation

Since their high computational cost, relative indices are most often studied in

small-dimensional datasets including small numbers of clusters [12, 24, 9, 7,

21, 4, 29]. The principal motivation was the ability to visually inspect data

in two/three dimensions, which help to easily predefine the optimal number of

clusters k . Consequently, indices can be compared according to their ability to

detect the optimal k . Even though such experiments can provide a preliminary

idea about the performance of each validity index, they do not reflect the reality

of their performance in real-world applications, where data is often multidimen-

sional and include a large number of clusters.

Mining Complex Data

Search WWH ::

Custom Search

Home