An Asymmetric Criterion for Cluster Validation - Developing Concepts in Applied Intelligence

Information Technology Reference

In-Depth Information

combine the primary results [25], [13] and [15]. Here, the difference is the absence of

some clusters in primary partitions. Since the original EAC method [13] cannot truly

identify the pairwise similarity while there is just a subset of clusters, in this paper a

new method to construct the co-association matrix is presented which hereafter called:

Extended Evidence Accumulation Clustering method, EEAC. Finally, the single-link

method is employed to extract the final clusters from this matrix.

3.1 Cluster Evaluation Using APMM Criteria

A stable cluster is the one that has a high likelihood of recurrence across multiple

applications of the clustering method. The stable clusters are usually preferable, since

they are robust with respect to minor changes in the dataset [18].

Now assume that we want to compute the stability of cluster C i . In this method first

a set of partitionings over resampled datasets is provided which is called the reference

set. In this notation D is resampled data and P(D) is a partitioning over D . Now, the

problem is: “How many times is the cluster C i repeated in the reference partitions?”

Denote by NMI( C i ,P(D) ), the Normalized Mutual Information between the cluster C i

and a reference partition P(D) . The most previous works only compare a partition

with another partition [25]. However, the stability used in [18] evaluates the

similarity between a cluster and a partition by transforming the cluster C i to a

partition and employing common partition to partition methods. To illustrate this

method let P 1 = P a ={ C i ,D/C i } be a partition with two clusters, where D/C i denotes the

set of data points in D that are not in C i .

Then we may compute a second partition P 2 = P b = {C * ,D/C * } , where C * denotes the

union of all “positive” clusters in P ( D ) and others are in D/C * . A cluster C j in P ( D ) is

positive if more than half of its data points are in C i .

Now, define NMI( C i ,P(D) ) by NMI( P a ,P b ) which is calculated as [12]:

⎛

⎞

∑∑

⎜ ⎜

⎝

⎟ ⎟

⎠

−

log

NMI

(

)

(1)

⎛

⎞

⎛

⎞

∑

⎜

⎝

⎟

⎠

⎜ ⎜

⎝

⎟ ⎟

⎠

log

⎜

⎟

where n is the total number of samples and

denotes the number of shared patterns

between clusters C a i ∈ P a and C a j ∈ P a .

This is done between the cluster C i and all partitions available in the reference set.

Fig. 2 shows this method.

NMI i in Fig. 2 shows the stability of cluster C i with respect to the i -th partition in

reference set. The total stability of cluster C i is defined as:

∑

Stability

(

)

NMI

(2)

where M is the number of partitions available in reference set. This procedure is

applied to each cluster of every primary partition.

Developing Concepts in Applied Intelligence

Search WWH ::

Custom Search

Home