Statistical Clustering Analysis: An Introduction - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

that assigning a point to a cluster according to the minimal Mahalanobis distance

with the cluster center is equivalent to assigning it to a cluster according to the

maximum likelihood value, as long as the distributions of clusters have similar

general variances.

5.2.1.3. Cosine

Cosine is widely used as a similarity measure in text clustering [27], which is:

x i ·

x j

s C ( x i , x j )=

(5.6)

x i ·

x j

x j = x i x j , the inner product of two vectors. In text clustering, usually

texts are coded according the presence (code 1) or absence (code 0) of the inter-

ested words or sentences. For instance, we are interested in five words (features)

A, B, C, D and E. Two texts are coded as x 1 =[1 , 0 , 0 , 0 , 0] and x 2 =[0 , 0 , 0 , 0 , 1],

which means in x 1 , only word A is present, and in x 2 , only word E is present. If

where x i ·

we use Euclidean distance to measure their dissimilarity, d 2 ( x 1 , x 2 )= √ 2.Now,

we consider another two texts x 3 =[1 , 1 , 1 , 1 , 0] and x 4 =[0 , 1 , 1 , 1 , 1].Their

Euclidean distance d 2 ( x 3 , x 4 )= √ 2. Clearly, texts x 1 and x 2 have no word in

common, but x 3 and x 4 have 3 out of 5 words in common. Texts x 3 and x 4 should

have lower dissimilarity than x 1 and x 2 . However, Euclidean distance measures

their dissimilarities the same.

Cosine solves this problem. The cosine of texts x 1 and x 2 is s C ( x 1 , x 2 )=0,

and that of texts x 3 and x 4 is s C ( x 3 , x 4 )=3 / 4.Itmeansthattexts x 3 and x 4 have

higher similarity than x 1 and x 2 .

5.2.2. Measures for Variable Clustering

Variable clustering is very important in identifying the dependency among vari-

ables, causal analysis, and selecting variables to reduce the dimension of data.

For instance, in stock market place, it is of significant importance to understand

which stocks are inter-dependent, the causal/result relationship among these inter-

dependent stocks, and which stocks are affecting the stocks of interest. In neuro-

science, in order to understand how neurons are cooperating with each other from

the neural activity data, one can cluster the neurons by calculating the similar-

ity (dissimilarity) measures among the spike train data (sequences) of neurons in

vivo .

In this subsection, we introduce two commonly association measures: Pear-

son's correlation coefficient and mutual information.

Search WWH ::

Custom Search

Home