Information Technology Reference
In-Depth Information
parameters. There are several cluster validation methods which are based on stability
concept [19]. Ben-Hur et al. [3] have proposed a technique to exploit the stability
measurements of the clustering solutions obtained by perturbing a data set. In their
approach, the stability is characterized by the distribution of the pairwise similarities
between clusterings obtained from sub samples of the data. First, the co-association
matrix is acquired using the resampling method. Then, Jaccard coefficient is extracted
from this matrix as the stability measure. Also, Estivill-Castro and Yang in [9] have
offered a method by which Support Vector Machines are used to evaluate the
separation of the clustering results. By filtering noise and outliers, this method can
identify the robust and potentially meaningful clustering result.
Moller and Radke [21] have introduced an approach to validate a clustering results
based on partition stability. This method uses a perturbation which is produced by
adding some noise to the data. An empirical study robustly indicates that the
perturbation usually outperforms bootstrapping and subsampling. Whereas the
empirical choice of the subsampling size is often difficult [7], the choosing of the
perturbation strength is not so crucial. This method uses a Nearest Neighbor
Resampling approach (NNR) that offers a solution to both problems of information
loss and empirical control of the change degree made to the original data. The NNR
techniques were first used for time series analysis [4]. Inokuchi et al. [17] have
proposed a kernelized validity measures where a kernel means the kernel function
used in support vector machines. Two measures are considered in this measure. One
is the sum of the traces of the fuzzy covariances within clusters and the second is a
kernelized Xie-Beni's measure [26]. This validity measure is applied to the
determination of the number of clusters and also the evaluation of robustness of
different partitionings. Das and Sil [6] have proposed a method to determine the
number of clusters which validates the clusters using splitting and merging technique
in order to obtain optimal set of clusters.
Fern and Lin [11] have suggested a clustering ensemble approach which selects a
subset of solutions to form a smaller but better performing cluster ensemble than
using all primary solutions. The ensemble selection method is designed based on
quality and diversity, the two factors that have been shown to influence the cluster
ensemble performance. This method attempts to select a subset of primary partitions
that simultaneously has both the highest quality and diversity. The Sum of
Normalized Mutual Information, SNMI [25], [12] and [13], is used to measure the
quality of individual partition with respect to other ones. Also, the Normalized Mutual
Information, NMI, is employed to measure the diversity between partitions. Although
the ensemble size in their method is relatively small, this method can achieve a
significant performance improvement over full ensembles. Law et al. propose a multi
objective data clustering method based on the selection of individual clusters
produced by several clustering algorithms, through an optimization procedure [18].
This technique chooses the best set of objective functions for different parts of the
feature space from the results of base clustering algorithms. Fred and Jain [14] have
offered a new clustering ensemble method that learns the pairwise similarity between
points in order to facilitate a proper partitioning of the data without the a priori
knowledge of the number of clusters and of the shape of the clusters. This method
Search WWH ::




Custom Search