A Projected Clustering Algorithm and Its Biomedical Application - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

the accuracy of the algorithm depends on whether the definition for the optimal

projective cluster is good.

9.2.2. Distance-based Algorithms

Distance-based algorithms define a cluster as a partition such that the distance

between objects within the same cluster is minimized and the distance between

objects from different clusters is maximized. Compared to density-based methods

in which each data point is assigned to all clusters with a different probability,

distance-based methods assign data to a cluster with probability 0 or 1. A distance

measure is defined between data points. The most commonly used measure is the

Euclidean distance that is not effective for projected clustering since data points

in two different clusters may have different number of dimensions. We can find

the solution in the following two distance-based algorithms.

PROCLUS (PROjected CLUStering) [1] allows the selection of different sub-

sets of dimensions for different clusters. It requires two user parameters: the num-

ber of clusters denoted by k and the average cardinality of the subsets of full di-

mensions denoted by l . The Manhattan segmental distance is used in this method

instead of the Euclidean distance since the number of dimensions has been nor-

malized away in the Manhattan segmental distance when comparing data points

in two different clusters that have different number of dimensions. The Manhattan

segmental distance between p 1 and p 2 relative to dimension set D is defined as:

( i∈D |

. In Fig. 9.2, we show the Manhattan segmental distance

by an example. The distance between x 1 and x 2 for dimension ( x , y ) is ( a + b ) / 2.

PROCLUS is based on k -medoid techniques and has three phases: an initializa-

tion phase, an iterative phase, and a refinement phase. In the initialization phase,

a random sample of data points with size A

p 1 ,i −

p 2 ,i |

) /

|

D

|

k is chosen and then a greedy algo-

rithm is applied to find a superset of medoids denoted by M with size B

×

k ( A ,

B are constants, and A>B ). In the iterative phase, the best k medoids are found

from M by the following steps. It first randomly finds a set of k medoids from

M and then finds an appropriate set of dimensions for each medoid in the set. It

then forms a cluster corresponding to each medoid. Bad medoids are chosen by

the following two rules: the medoid of the cluster with the least number of points

is bad; the medoid of any cluster with less than ( N/k )

×

minDeviation points

is bad, where minDeviation is a constant smaller than 1. The replacing logic is

that the current clustering is compared with the case that the bad medoids are re-

placed with random points in M . The result is replaced if the new clustering is

better. Otherwise, the bad medoids are replaced with other random points in M .

If the result doesn't change after a certain number of attempts have been tried, the

×

Clustering Challenges in Biological Network

Search WWH ::

Custom Search

Home