Biology Reference
In-Depth Information
comes to clustering, the distributions of points along different dimensions also
matter. For example, the distance of points
x
1
and
x
2
in dimension
i
and
j
is 20
and 200 respectively, but that doesn't necessarily mean that
x
1
and
x
2
are closer
in dimension
i
than they are in dimension
j
since data points in dimension
i
and
j
may have different distributions. Therefore, there is a need to normalize the
distance along each dimension. We want to find a normalization factor
n
i
for each
dimension, and the modified Manhattan segmental distance between
x
1
and
x
2
relative to dimension set
D
can be defined as: (
i∈D
|
p
1
,i
−
p
2
,i
|
/n
i
)
/
|
D
|
. In our
algorithm, we use the standard deviation of all points in a dataset along dimension
i
as the normalization factor
n
i
.
9.3.2.
Initialization Phase
In the initialization phase, all data points are first chosen by random to form a
random data sample set
S
with size
A
k
,where
A
is a constant. Then
S
is
chosen by a greedy algorithm to obtain an even smaller set of points
M
with size
B
×
k
,where
B
is a small constant. The greedy algorithm is based on the concept
that we avoid choosing the medoids from the same cluster; therefore, we choose
the set of points which are most far apart. The greedy algorithm has been proposed
in [8] and is illustrated in Algorithm 9.2.
×
Algorithm
9.2.
Greedy(Set
of
points:
S
,
Number
of
medoids:
k
)
{
d
(.,.) is the distance function
}
begin
M
=
{
m
1
}{
m
1
is a random point of
S
}
{
compute distance between each point and medoid
m
1
}
for
each
x
M
do
dist
(
x
)=
d
(
x,m
1
)
end for
for
i
=2to
k
do
{
∈
S
\
choose medoid
m
i
to be far from previous medoids
}
let
m
i
∈
S
\
M
be s. t.
dist
(
m
i
)=
max
{
dist
(
x
)
|
x
∈
S
\
M
}
M
=
M
∪{
m
i
}
{
compute distance of each point with closest medoid
}
for
each
x
M
do
dist
(
x
)=
min
∈
S
\
{
dist
(
x
)
,d
(
x,m
i
)
}
end for
Search WWH ::
Custom Search