Biology Reference
In-Depth Information
comes to clustering, the distributions of points along different dimensions also
matter. For example, the distance of points x 1 and x 2 in dimension i and j is 20
and 200 respectively, but that doesn't necessarily mean that x 1 and x 2 are closer
in dimension i than they are in dimension j since data points in dimension i and
j may have different distributions. Therefore, there is a need to normalize the
distance along each dimension. We want to find a normalization factor n i for each
dimension, and the modified Manhattan segmental distance between x 1 and x 2
relative to dimension set D can be defined as: ( i∈D |
p 1 ,i
p 2 ,i |
/n i ) /
|
D
|
. In our
algorithm, we use the standard deviation of all points in a dataset along dimension
i as the normalization factor n i .
9.3.2. Initialization Phase
In the initialization phase, all data points are first chosen by random to form a
random data sample set S with size A
k ,where A is a constant. Then S is
chosen by a greedy algorithm to obtain an even smaller set of points M with size
B
×
k ,where B is a small constant. The greedy algorithm is based on the concept
that we avoid choosing the medoids from the same cluster; therefore, we choose
the set of points which are most far apart. The greedy algorithm has been proposed
in [8] and is illustrated in Algorithm 9.2.
×
Algorithm
9.2. Greedy(Set
of
points:
S ,
Number
of
medoids:
k )
{
d (.,.) is the distance function
}
begin
M =
{
m 1 }{
m 1 is a random point of S
}
{
compute distance between each point and medoid m 1 }
for each x
M do
dist ( x )= d ( x,m 1 )
end for
for i =2to k do
{
S
\
choose medoid m i to be far from previous medoids
}
let m i
S
\
M be s. t. dist ( m i )= max
{
dist ( x )
|
x
S
\
M
}
M = M
∪{
m i }
{
compute distance of each point with closest medoid
}
for each x
M do
dist ( x )= min
S
\
{
dist ( x ) ,d ( x,m i )
}
end for
Search WWH ::




Custom Search