Graph Algorithms for Integrated Biological Analysis, with Applications to Type 1 Diabetes Data - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

protein interaction were used to devise distance measures and permutation tests

for strength of commonality in graphs from these different data sources. Although

no quantitative protein values were employed, data derived from Saccharomyces

cerevisiae , commonly known as baker's or budding yeast, suggested that similar-

ity in expression is related to similarity in function.

Our main goal is to identify biological pathways, each of which is anchored

by a protein of interest. We are fortunate that both gene expression array data and

protein gel data were collected from the exact same samples. If it were not for the

expense involved, we would wonder why this is not done more often. Neverthe-

less, data integration remains a formidable task. The biggest difficulty we must

overcome is probably that transcriptomic and proteomic data are generated by

two completely different and unrelated processes. Thus we will not be able to use

parametric statistical procedures, including the highly favored Pearson's correla-

tion technique. Another problem is that current technologies for protein sensing

are generally inferior to those for transcript detection. Modern expression array

platforms can often detect transcripts for more than 50% of the known genes in the

relevant organism, and generate highly reproducible quantitative measurements.

In contrast, protein identification platforms can seldom cover more than 10% of

an organism's estimated number of proteins, and with only moderate quantization

and reproducibility. Of course function is a direct consequence of proteins, not

mRNA, and so the importance of protein expression cannot be underestimated.

Finally, it is well known that gene expression at the mRNA level will not always

correlate well with gene expression at the protein level. After all, gene products

are subject to post-transcriptional and post-translational modifications, degrada-

tion and other factors. Put together, these difficulties make any serious attempt

at transcript-protein co-expression analysis a huge challenge. In the sequel, we

shall address this challenge with non-parametric methods, graph algorithms and a

clique-centric combinatorial approach.

We begin with the establishment of two correlation structures. For transcript-

transcript relationships, we retain the Pearson'scoefficients already computed.

Transcript-protein relationships are typically much weaker and, for reasons al-

ready stated, require a non-parametric approach. For these we employ the rank

metric provided by Spearman's correlation technique. This naturally leads to the

loss of some information; a simple ranked list “flattens” raw data values. Our aim

is now two-fold. We still wish to find dense, well-connected subgraphs. Yet these

subgraphs must also be anchored as much as possible about some given protein,

p , under scrutiny. Of course we could simply choose a putative pathway to be

p and those transcripts ranked most highly with it. As we shall show, however,

we can do better with the use of graph structure. To accomplish this, we take the

Search WWH ::

Custom Search

Home