Analysis of Regulatory and Interaction Networks from Clusters of Co-expressed Genes - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

of motif values. This allows for the identification of (i) overpopulated motifs, and

(ii) genes sharing similar motif values. Hence we have achieved a fine-grained

“clustering” of the data where the number of potential clusters is dependent upon

the definition of the hashing function.

(c) Quantification of transcription state. We define the transcription state of

the system as the CDF of expression values of a select subset of motifs (based on

the corresponding genes) and we will track this quantity as it evolves over time

relative to the control state (distribution at t=0hr). We characterize each motif for

its ability to represent the overall transcription dynamics of the system. In order

to do so we define a new term, transcriptional state that quantifies the deviation of

the aggregate distribution of expression values from a control state. An optimiza-

tion framework is defined which characterizes expression motifs for their strength

in replicating the entire system. Thus, we are able to rank the expression motif for

their contribution to the overall state change of the system. The minimum number

of expression motifs required to accurately represent the dynamic response of the

system defines the set of informative genes, i.e., genes maximally affected by the

specific experimental perturbation. To quantify the hypothesis that informative

subsets of genes should give rise to a distribution of expression values maximally

affected by the experiment, the Kolmogorov-Smirnov (KS) [49] test for evaluating

whether or not two arbitrary distributions are different, is employed. Informative

subsets are the ones with the ability to capture significant deviations from the base

distribution. The KS statistic is defined as: D =max

,

where F ( Y gi (0)) is the cumulative distribution of the expression values at time t=0

This statistic allows a metric that defines the magnitude of the difference between

two distributions to be computed. Since the data is presented as a time series,

at each time point a value for the KS statistic is obtained. Therefore, the overall

metric becomes .With the definition of the transcriptional state and the ability to

quantify the deviations from the control (sham) state we are now in the position

to define a rigorous methodology for selecting maximally informative expression

motifs. The application of the KS test over time allows us to quantify just how

much the CDF of a particular sub-set of genes deviates from the corresponding

CDF at time t=0 (control/sham). We currently implement a greedy algorithm that

adds peaks in the order of their population and select the subset with the greatest

deviation. The greedy heuristic was selected to minimize the combinatorial com-

plexity of the problem, and we feel that is an adequate approximation due to fact

that the greater over-representation of a motif, the more important this motif is.

A detailed discussion of the methodology is presented in [50]. In order to fully

explore the methods we focus on two distinct experimental protocols to assess the

1 ≤i≤n |

F ( Y gi )

−

F ( Y gi (0))

|

Clustering Challenges in Biological Network

Search WWH ::

Custom Search

Home