Biology Reference
In-Depth Information
that assigning a point to a cluster according to the minimal Mahalanobis distance
with the cluster center is equivalent to assigning it to a cluster according to the
maximum likelihood value, as long as the distributions of clusters have similar
general variances.
5.2.1.3. Cosine
Cosine is widely used as a similarity measure in text clustering [27], which is:
x i ·
x j
s C ( x i , x j )=
(5.6)
x i ·
x j
x j = x i x j , the inner product of two vectors. In text clustering, usually
texts are coded according the presence (code 1) or absence (code 0) of the inter-
ested words or sentences. For instance, we are interested in five words (features)
A, B, C, D and E. Two texts are coded as x 1 =[1 , 0 , 0 , 0 , 0] and x 2 =[0 , 0 , 0 , 0 , 1],
which means in x 1 , only word A is present, and in x 2 , only word E is present. If
where x i ·
we use Euclidean distance to measure their dissimilarity, d 2 ( x 1 , x 2 )= 2.Now,
we consider another two texts x 3 =[1 , 1 , 1 , 1 , 0] and x 4 =[0 , 1 , 1 , 1 , 1].Their
Euclidean distance d 2 ( x 3 , x 4 )= 2. Clearly, texts x 1 and x 2 have no word in
common, but x 3 and x 4 have 3 out of 5 words in common. Texts x 3 and x 4 should
have lower dissimilarity than x 1 and x 2 . However, Euclidean distance measures
their dissimilarities the same.
Cosine solves this problem. The cosine of texts x 1 and x 2 is s C ( x 1 , x 2 )=0,
and that of texts x 3 and x 4 is s C ( x 3 , x 4 )=3 / 4.Itmeansthattexts x 3 and x 4 have
higher similarity than x 1 and x 2 .
5.2.2. Measures for Variable Clustering
Variable clustering is very important in identifying the dependency among vari-
ables, causal analysis, and selecting variables to reduce the dimension of data.
For instance, in stock market place, it is of significant importance to understand
which stocks are inter-dependent, the causal/result relationship among these inter-
dependent stocks, and which stocks are affecting the stocks of interest. In neuro-
science, in order to understand how neurons are cooperating with each other from
the neural activity data, one can cluster the neurons by calculating the similar-
ity (dissimilarity) measures among the spike train data (sequences) of neurons in
vivo .
In this subsection, we introduce two commonly association measures: Pear-
son's correlation coefficient and mutual information.
Search WWH ::




Custom Search