A Novel Similarity-based Modularity Function for Graph Partitioning - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

2. Considering merging every possible pair of cluster c g and c h ,( g

= h ) and note

the change to Q S , which is given by:

∆ Q S g + h = Q S g + h −

Q S g −

Q S h

(11.10)

3. Choose the merge producing the largest ∆ Q S g + h , then update and record Q S .

4. Update all affected values of ∆ Q S g + i

and ∆ Q S h + j , ( i

= g ; j

= h ) using

formula (11.10).

5. Repeat step 3 and 4 until all vertices are grouped into one cluster.

6. Retrieve the partitioning with the highest Q S value as the best result.

In step 4, only clusters with edges connecting them to the merged pair need to

be updated. The number of clusters to be updated that are connected to cluster g is

|

updates,

each update taking O(log k ) < O(log n ) time to access the data structure, where

n is the number of vertices in the Graph. Thus each iteration takes O((

i

|

and the number connected to cluster h is

|

j

|

. There are at most

|

i

|

+

|

j

|

i

|

j

|

+

)log

n ) time [1].

The overall running time is the sum of all joining steps, which, as Newman

indicated is at most O( md log n ) [13], where m is number of edges and d is the

depth of the dendrogram. More details of this algorithm may be found in Newman

et al. [1, 13].

11.6. Evaluation Results

In this section, we present experimental result from analyzing synthetic graphs,

detecting community structure in social networks, and clustering entities from a

customer database. The size of the graphs varies from tiny to very large. Each

example has a known structure that we are seeking to reproduce. We introduce a

measure of “error” to gauge the performance of different clustering algorithms: if

a node is classified into a cluster other than the pre-defined cluster, we count it as

an error.

11.6.1. Tests on Synthetic Graphs

Synthetic graphs are generated based on predefined cluster structure and proper-

ties, so as to facilitate systematic study of the performance of our similarity-based

modularity definition and graph partitioning algorithms. Here we use the same

construction as used in several papers [6, 7, 13]. The synthetic graph consists of

128 vertices that are divided into 4 equal-size clusters. Each vertex connects to

vertices within its cluster with a probability P i , and connects to vertices outside

its cluster with a probability P o <P i . On average, each vertex is connected to

Clustering Challenges in Biological Network

Search WWH ::

Custom Search

Home