A Projected Clustering Algorithm and Its Biomedical Application - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

end for

return M

end

9.3.3. Iterative Phase

We begin by choosing a random set of k points from M . Then we iteratively

replace the bad medoids in the current best medoids set with random points from

M until the current best medoids set does not change after a certain number of

replacements have been tried.

In each iteration, we first find dimensions for each medoid in the set, and form

the cluster corresponding to each medoid. Then we evaluate the clustering and

replace the bad medoids in the current best medoids set if the new clustering is

better.

In order to find dimensions, we first define several notations. For each medoid

m i , δ i is the minimum distance from any other medoids to m i based on full di-

mensions. We define the locality L i to be the set of points within the distance of

δ i from m i and define X i,j to be the average distance to m i along dimension j .

We then calculate X i,j as follows. We divide the average distance from the points

in L i to m i along dimension j by the normalization factor n j . There are two con-

straints when associating dimensions to medoids. The total number of dimensions

associated to medoids must be equal to k

l since k is the number of clusters and

l is the average number of dimensions in a cluster. The number of dimensions

associated with each medoid must be at least 2, which makes each cluster mean-

ingful. For each medoid i , we compute the mean Y i = j =1 X i,j d ,and

the standard deviation σ i = j ( X i,j −

1) of the values X i,j . Y i

represents the average modified Manhattan segmental distance of the points in L i

relative to the entire space. Thus Z i,j = ( X i,j −

Y i ) 2 / ( d

−

Y i ) /σ i indicates how the average

distance along dimension j associated with the medoid m i is related to the aver-

age modified Manhattan segmental distance associated with the same medoid. We

decide dimensions for all clusters by picking the smallest Z i,j values by a greedy

algorithm [10] used in PROCLUS such that a total of k

l values are chosen and

at least 2 values are chosen for each medoid. More specifically, all the Z i,j values

are sorted in increasing order. We assign the two smallest for each i , which gives a

total of 2 k values, and then greedily pick the remaining lowest k× ( l− 2) values.

We do a single pass over the database to assign the points to the medoids.

For each i , we compute the modified Manhattan segmental distance relative to D i

between a point and the medoid m i , and assign the point to the closest medoid.

Search WWH ::

Custom Search

Home