Information Technology Reference
In-Depth Information
significant amelioration in the inter-cluster separation since they are not enough
distant from C p . This layer is delimited by the thresholds t 1and t 2previously
defined.
Context Risk Calculation. In order to calculate CR for a candidate cluster
C p ,weusethefollowingformula:
n 1
n 2
n 3
CR ( C p )= 1
K (
R ( C i ,C p )
I 1( C j ,C p )
I 2( C h ,C p ))
i =1
j =1
h =1
where R ( C i ,C p ), I 1( C j ,C p ), I 2( C h ,C p ) denote the score given for a cluster C p
situated respectively in the risk layer, intra layer, and inter layer. All the scores
are distributed along a [0,1] range according to their distances with the centroid
of C p (See Figures 16.3 and 16.4). n 1, n 2, n 3 denote the number of clusters
situated respectively in the risk, intra and inter layers. Consequently, CR varies
between -1 (for a minimal risk) and 1 (for a maximal risk). For a contextual
cluster C x having a distance d x with S p , its score is calculated with respect to
the following conditions:
I 1( C x ,C p )= t 1 −d x
t 1
if
d x <t 1
then
I 2( C x ,C p )= d x −t 2
t 1
else
if
d x >t 2
then
else
if
d x >t 3
then
I 2( C x ,C p )=1
R ( C x ,C p )= d x −t 1
t 1 / 2
else
if
t 1 ≤ d x < ( t 1+ t 2) / 2 then
R ( C x ,C p )= t 2 −d x
t 1 / 2
else
if
( t 1+ t 2) / 2 ≤ d x ≤ t 2 then
In terms of complexity, CR cannot be considered as computationally expen-
sive. In fact, given K nearest neighbors, and p clusters at a specific iteration,
we compute CR for M candidate clusters resulted from the merging of the M
closest pairs of clusters. Therefore, at a given iteration, the added complexity to
a clustering algorithm is O ( M ( p
1) / 2)). Furthermore, we argue
that the parameters M and K have “second order” effect on the results. In other
words, they are not 'critical' parameters, and their choices depend solely on the
extent to which we are able to augment the complexity of the algorithm.
1+ K + K ( K
16.4
Experimental Study on Relative Validity Indices
16.4.1
Motivation
Since their high computational cost, relative indices are most often studied in
small-dimensional datasets including small numbers of clusters [12, 24, 9, 7,
21, 4, 29]. The principal motivation was the ability to visually inspect data
in two/three dimensions, which help to easily predefine the optimal number of
clusters k . Consequently, indices can be compared according to their ability to
detect the optimal k . Even though such experiments can provide a preliminary
idea about the performance of each validity index, they do not reflect the reality
of their performance in real-world applications, where data is often multidimen-
sional and include a large number of clusters.
Search WWH ::




Custom Search