Biology Reference
In-Depth Information
that assigning a point to a cluster according to the minimal Mahalanobis distance
with the cluster center is equivalent to assigning it to a cluster according to the
maximum likelihood value, as long as the distributions of clusters have similar
general variances.
5.2.1.3.
Cosine
Cosine is widely used as a similarity measure in text clustering [27], which is:
x
i
·
x
j
s
C
(
x
i
,
x
j
)=
(5.6)
x
i
·
x
j
x
j
=
x
i
x
j
, the inner product of two vectors. In text clustering, usually
texts are coded according the presence (code 1) or absence (code 0) of the inter-
ested words or sentences. For instance, we are interested in five words (features)
A, B, C, D and E. Two texts are coded as
x
1
=[1
,
0
,
0
,
0
,
0] and
x
2
=[0
,
0
,
0
,
0
,
1],
which means in
x
1
, only word A is present, and in
x
2
, only word E is present. If
where
x
i
·
we use Euclidean distance to measure their dissimilarity,
d
2
(
x
1
,
x
2
)=
√
2.Now,
we consider another two texts
x
3
=[1
,
1
,
1
,
1
,
0]
and
x
4
=[0
,
1
,
1
,
1
,
1].Their
Euclidean distance
d
2
(
x
3
,
x
4
)=
√
2. Clearly, texts
x
1
and
x
2
have no word in
common, but
x
3
and
x
4
have 3 out of 5 words in common. Texts
x
3
and
x
4
should
have lower dissimilarity than
x
1
and
x
2
. However, Euclidean distance measures
their dissimilarities the same.
Cosine solves this problem. The cosine of texts
x
1
and
x
2
is
s
C
(
x
1
,
x
2
)=0,
and that of texts
x
3
and
x
4
is
s
C
(
x
3
,
x
4
)=3
/
4.Itmeansthattexts
x
3
and
x
4
have
higher similarity than
x
1
and
x
2
.
5.2.2.
Measures for Variable Clustering
Variable clustering is very important in identifying the dependency among vari-
ables, causal analysis, and selecting variables to reduce the dimension of data.
For instance, in stock market place, it is of significant importance to understand
which stocks are inter-dependent, the causal/result relationship among these inter-
dependent stocks, and which stocks are affecting the stocks of interest. In neuro-
science, in order to understand how neurons are cooperating with each other from
the neural activity data, one can cluster the neurons by calculating the similar-
ity (dissimilarity) measures among the spike train data (sequences) of neurons
in
vivo
.
In this subsection, we introduce two commonly association measures: Pear-
son's correlation coefficient and mutual information.
Search WWH ::
Custom Search