Biology Reference
In-Depth Information
the accuracy of the algorithm depends on whether the definition for the optimal
projective cluster is good.
9.2.2. Distance-based Algorithms
Distance-based algorithms define a cluster as a partition such that the distance
between objects within the same cluster is minimized and the distance between
objects from different clusters is maximized. Compared to density-based methods
in which each data point is assigned to all clusters with a different probability,
distance-based methods assign data to a cluster with probability 0 or 1. A distance
measure is defined between data points. The most commonly used measure is the
Euclidean distance that is not effective for projected clustering since data points
in two different clusters may have different number of dimensions. We can find
the solution in the following two distance-based algorithms.
PROCLUS (PROjected CLUStering) [1] allows the selection of different sub-
sets of dimensions for different clusters. It requires two user parameters: the num-
ber of clusters denoted by k and the average cardinality of the subsets of full di-
mensions denoted by l . The Manhattan segmental distance is used in this method
instead of the Euclidean distance since the number of dimensions has been nor-
malized away in the Manhattan segmental distance when comparing data points
in two different clusters that have different number of dimensions. The Manhattan
segmental distance between p 1 and p 2 relative to dimension set D is defined as:
( i∈D |
. In Fig. 9.2, we show the Manhattan segmental distance
by an example. The distance between x 1 and x 2 for dimension ( x , y ) is ( a + b ) / 2.
PROCLUS is based on k -medoid techniques and has three phases: an initializa-
tion phase, an iterative phase, and a refinement phase. In the initialization phase,
a random sample of data points with size A
p 1 ,i
p 2 ,i |
) /
|
D
|
k is chosen and then a greedy algo-
rithm is applied to find a superset of medoids denoted by M with size B
×
k ( A ,
B are constants, and A>B ). In the iterative phase, the best k medoids are found
from M by the following steps. It first randomly finds a set of k medoids from
M and then finds an appropriate set of dimensions for each medoid in the set. It
then forms a cluster corresponding to each medoid. Bad medoids are chosen by
the following two rules: the medoid of the cluster with the least number of points
is bad; the medoid of any cluster with less than ( N/k )
×
minDeviation points
is bad, where minDeviation is a constant smaller than 1. The replacing logic is
that the current clustering is compared with the case that the bad medoids are re-
placed with random points in M . The result is replaced if the new clustering is
better. Otherwise, the bad medoids are replaced with other random points in M .
If the result doesn't change after a certain number of attempts have been tried, the
×
Search WWH ::




Custom Search