Information Technology Reference
In-Depth Information
Generally, there are two main steps in clustering ensemble: (a) the creation of some
weak partitionings, (b) the aggregation of the obtained primary partitioning. The first
step is the creation of some weak partitionings. Because every primary partitioning
reveals a hidden aspect of a data, their ensemble can cover their individual drawbacks.
So, the primary results are needed to be as diverse as possible to give more
information about the underlying patterns in the data. Many methods have been
suggested to create the necessary diversity for the primary results. To do this, using
different clustering algorithms is the simplest way. Some other methods are choosing
different initialization, different algorithm parameters, subset of features, mapping the
data to other feature spaces [1], resampling of the data [20]. In this paper the
resampling, different base algorithms, different initialization and different parameters
are used to provide the necessary diversity for the primary results.
The second main step in clustering ensemble is to combine the primary
partitionings obtained in the first step. The co-association matrix based aggregator is
one of the most common methods to combine the primary partitionings which is
employed in this paper too. EAC which is first proposed by Fred and Jain maps the
individual data partitions in a clustering ensemble into a new between-patterns
similarity measure, summarizing inter-pattern structures perceived from these
clusterings. The final data partition is obtained by applying the single-link method to
this new similarity matrix [13].
In this paper a new clustering ensemble method is proposed which uses a subset of
primary clusters. A new validity measure which is called Improved Stability,
IStability, is suggested to evaluate the cluster goodness. Each cluster that satisfies a
threshold of IStability can be considered to participate in constructing the co-
association matrix. A new method named Extended Evidence Accumulation
Clustering, EEAC, is proposed to construct this matrix. Finally, a hierarchical method
is applied over the obtained matrix to extract the final partition.
2 Background
The clustering ensemble which is based on a subset of selected primary clusters or
partitions has a main problem which is the manner of evaluating clusters or partitions.
As the data clustering is an unsupervised problem, its validation process is the most
troublesome task. Baumgartner et al. in [2] have presented a resampling based
technique to validate the results of exploratory fuzzy clustering analysis. Since the
concept of cluster stability is introduced as a means to assess the validity of data
partitionings, it has been incrementally used in the literature [14]. This idea which is
based on resampling method is initially described in [5] and later generalized in
different ways in [16]. Roth et al. in [24] have proposed a resampling based technique
to validate a cluster. The basic element in their method which is a complementary
version of the past methods is cluster stability. The stability measures the association
between obtained partitions from two individual clustering algorithms. The great
values of the stability measure mean that applying the clustering algorithm several
times on a data set probably yields the fixed results [22]. Roth and Lange [23] have
presented a new algorithm for data clustering which is based on feature selection. In
their method the resampling based stability measure is used to set the algorithm
Search WWH ::




Custom Search