Information Technology Reference
In-Depth Information
Due to their facility of interpretation, we choose to focus on indices not scaling
with k . The other reason is related to our primary goal which is to enable the
usage of relative indices as stopping criteria. This is definitely a hard task that
will get much harder if the optimal k must be selected using relatively sophis-
ticated techniques like the “knee” detection, the gap statistics, or the stability
approach.
A known drawback when involving validity indices in clustering is the com-
putational cost that quickly becomes prohibitive when scaling to large and high-
dimensional datasets. The main reason is that pairwise similarities between the
dataset elements/clusters have to be calculated. As a preliminary attempt to
reduce complexity, we propose a new validity index, H 3, that we define as
follows 1 .
H 3= i =1 n i . n i
j =1 sim ( e j ,S i )
( i =1 sim ( S i ,S )) /k
where sim denotes the similarity between two objects, S i denotes the centroid
of cluster C i containing n i elements, e j denotes a data element, and S denotes
the collection's centroid which is the average vector of all clusters' centroids.
H 3 is significantly less expensive than other indices. The reason is that H 3
deals with centroids to calculate the inter-cluster separation and the intra-cluster
compactness. It has a linear complexity of O ( k + n ). H 3 is inspired from the H 1
and H 2 indices proposed by Zhao [38]. The difference is that H 3 does not follow
the trend of k after having removed its sensitivity to k in an ad-hoc manner.
As a matter of fact, the intra-cluster similarity decreases as k decreases, thus
the quality of clustering continuously deteriorates from an intra-cluster point
of view. We consider that an optimal partition is reached, when the average of
inter-cluster similarities, that tends to improve while grouping similar objects,
is no more able to overwhelm the intra-cluster deterioration.
16.3
A Method for Enhancing Relative Indices Usage as
Stopping Criteria
In this section, we propose a method that aims to reduce the complexity of
clustering algorithms when involving relative indices as criterion functions. One
potential option is to use these indices to terminate the clustering process at a
point where the “optimal” solution is reached, which allows to discard all the
remaining unnecessary part of the process.
16.3.1
Problem Definition
The classical usage of relative validity indices for determining the k yielding
the optimal clustering solution comes a-posteriori , after evaluating the different
1 Evaluation of the H 3 index is out of scope in this chapter.
 
Search WWH ::




Custom Search