An Asymmetric Criterion for Cluster Validation - Developing Concepts in Applied Intelligence

Information Technology Reference

In-Depth Information

Generally, there are two main steps in clustering ensemble: (a) the creation of some

weak partitionings, (b) the aggregation of the obtained primary partitioning. The first

step is the creation of some weak partitionings. Because every primary partitioning

reveals a hidden aspect of a data, their ensemble can cover their individual drawbacks.

So, the primary results are needed to be as diverse as possible to give more

information about the underlying patterns in the data. Many methods have been

suggested to create the necessary diversity for the primary results. To do this, using

different clustering algorithms is the simplest way. Some other methods are choosing

different initialization, different algorithm parameters, subset of features, mapping the

data to other feature spaces [1], resampling of the data [20]. In this paper the

resampling, different base algorithms, different initialization and different parameters

are used to provide the necessary diversity for the primary results.

The second main step in clustering ensemble is to combine the primary

partitionings obtained in the first step. The co-association matrix based aggregator is

one of the most common methods to combine the primary partitionings which is

employed in this paper too. EAC which is first proposed by Fred and Jain maps the

individual data partitions in a clustering ensemble into a new between-patterns

similarity measure, summarizing inter-pattern structures perceived from these

clusterings. The final data partition is obtained by applying the single-link method to

this new similarity matrix [13].

In this paper a new clustering ensemble method is proposed which uses a subset of

primary clusters. A new validity measure which is called Improved Stability,

IStability, is suggested to evaluate the cluster goodness. Each cluster that satisfies a

threshold of IStability can be considered to participate in constructing the co-

association matrix. A new method named Extended Evidence Accumulation

Clustering, EEAC, is proposed to construct this matrix. Finally, a hierarchical method

is applied over the obtained matrix to extract the final partition.

2 Background

The clustering ensemble which is based on a subset of selected primary clusters or

partitions has a main problem which is the manner of evaluating clusters or partitions.

As the data clustering is an unsupervised problem, its validation process is the most

troublesome task. Baumgartner et al. in [2] have presented a resampling based

technique to validate the results of exploratory fuzzy clustering analysis. Since the

concept of cluster stability is introduced as a means to assess the validity of data

partitionings, it has been incrementally used in the literature [14]. This idea which is

based on resampling method is initially described in [5] and later generalized in

different ways in [16]. Roth et al. in [24] have proposed a resampling based technique

to validate a cluster. The basic element in their method which is a complementary

version of the past methods is cluster stability. The stability measures the association

between obtained partitions from two individual clustering algorithms. The great

values of the stability measure mean that applying the clustering algorithm several

times on a data set probably yields the fixed results [22]. Roth and Lange [23] have

presented a new algorithm for data clustering which is based on feature selection. In

their method the resampling based stability measure is used to set the algorithm

Developing Concepts in Applied Intelligence

Search WWH ::

Custom Search

Home