An Asymmetric Criterion for Cluster Validation - Developing Concepts in Applied Intelligence

Information Technology Reference

In-Depth Information

parameters. There are several cluster validation methods which are based on stability

concept [19]. Ben-Hur et al. [3] have proposed a technique to exploit the stability

measurements of the clustering solutions obtained by perturbing a data set. In their

approach, the stability is characterized by the distribution of the pairwise similarities

between clusterings obtained from sub samples of the data. First, the co-association

matrix is acquired using the resampling method. Then, Jaccard coefficient is extracted

from this matrix as the stability measure. Also, Estivill-Castro and Yang in [9] have

offered a method by which Support Vector Machines are used to evaluate the

separation of the clustering results. By filtering noise and outliers, this method can

identify the robust and potentially meaningful clustering result.

Moller and Radke [21] have introduced an approach to validate a clustering results

based on partition stability. This method uses a perturbation which is produced by

adding some noise to the data. An empirical study robustly indicates that the

perturbation usually outperforms bootstrapping and subsampling. Whereas the

empirical choice of the subsampling size is often difficult [7], the choosing of the

perturbation strength is not so crucial. This method uses a Nearest Neighbor

Resampling approach (NNR) that offers a solution to both problems of information

loss and empirical control of the change degree made to the original data. The NNR

techniques were first used for time series analysis [4]. Inokuchi et al. [17] have

proposed a kernelized validity measures where a kernel means the kernel function

used in support vector machines. Two measures are considered in this measure. One

is the sum of the traces of the fuzzy covariances within clusters and the second is a

kernelized Xie-Beni's measure [26]. This validity measure is applied to the

determination of the number of clusters and also the evaluation of robustness of

different partitionings. Das and Sil [6] have proposed a method to determine the

number of clusters which validates the clusters using splitting and merging technique

in order to obtain optimal set of clusters.

Fern and Lin [11] have suggested a clustering ensemble approach which selects a

subset of solutions to form a smaller but better performing cluster ensemble than

using all primary solutions. The ensemble selection method is designed based on

quality and diversity, the two factors that have been shown to influence the cluster

ensemble performance. This method attempts to select a subset of primary partitions

that simultaneously has both the highest quality and diversity. The Sum of

Normalized Mutual Information, SNMI [25], [12] and [13], is used to measure the

quality of individual partition with respect to other ones. Also, the Normalized Mutual

Information, NMI, is employed to measure the diversity between partitions. Although

the ensemble size in their method is relatively small, this method can achieve a

significant performance improvement over full ensembles. Law et al. propose a multi

objective data clustering method based on the selection of individual clusters

produced by several clustering algorithms, through an optimization procedure [18].

This technique chooses the best set of objective functions for different parts of the

feature space from the results of base clustering algorithms. Fred and Jain [14] have

offered a new clustering ensemble method that learns the pairwise similarity between

points in order to facilitate a proper partitioning of the data without the a priori

knowledge of the number of clusters and of the shape of the clusters. This method

Developing Concepts in Applied Intelligence

Search WWH ::

Custom Search

Home