Graph Algorithms for Integrated Biological Analysis, with Applications to Type 1 Diabetes Data - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

ples with percent coefficients of variance usually in the low single digits [24, 29].

In contrast, protein expression data involves technologies that are more compli-

cated and difficult to standardize. Technical reproducibility of protein expression

data collected from identical samples often has percent coefficients of variance in

the low double digit range [22, 31].

10.3. Correlation Computations

We employ the aforementioned 30 samples to compute a correlation matrix. The

matrix entry at location ( i,j ) denotes the correlation coefficient between the i th

and j th items (genes or proteins), normalized to the range [-1.0,1.0]. Because

mRNA arrays alone can measure over 45,000 different values, we may be faced

with making sense of over a trillion correlate pairs. Close examination of the data

reveals a paucity of outliers, so that we are able to use the well-known Pearson's

method for the computation of correlation coefficients. Because we are searching

for putative pathways and networks, both positive and negative correlations are of

equal interest. We therefore take absolute correlation values. Recall that this is

biological and hence noisy data. Not every probe set is reliably measured in every

sample. Thus we move away from simple correlation and compute a p-value

for each pair of correlates, which is the probability that they have a correlation

different from zero [33]. See Fig. 10.2.

From this we can build a simple, unweighted graph as needed with the use of

a cut-off value (we favor the use of p=0.01) and a high-pass filter. An edge whose

weight is less then the cut-off is discarded. Other edges are retained, but their

weights are now ignored.

10.4. Clique and Its Variants

We assume the reader is familiar with standard concepts in graph and complexity

theory [25, 30]. We begin with the well-known clique problem. A clique is a

densest possible subgraph. Each pair of its vertices is connected by an edge. A

clique is maximum if it is a largest clique in a graph. A clique is maximal if it is

not contained wholly within a larger clique. A clique on five vertices is illustrated

in Fig. 10.3. Protein correlations are too weak to find relevant relationships at this

level, and so for them we turn to other methods as will be described in Section

10.6. The correlation matrix is transformed into a complete, weighted correla-

tion graph by using a vertex for each transcript and protein, and by weighting the

edge between each pair of items with the corresponding correlation matrix entry.

Clique is widely acknowledged for its many applications in computational molec-

Search WWH ::

Custom Search

Home