Functional classification of proteins based on protein interaction data (Proteomics)

Deciphering gene/protein function on a large scale for a better understanding of cell functioning and organism development is one of the biggest challenges in biology. For this purpose, approaches have been designed following both methodological progress and thinking advancement. In this respect, computational methods played a constant role overtime, evolving with the way biologists were apprehending gene/protein function.

Since the seventies, biologists have been comparing protein sequences using alignment methods (Needleman and Wunsch, 1970; Smith and Waterman, 1981). They progressively introduced useful measures such as the identity and similarity percentages, the z-scores, the Blast scores, and so on (see Article 93, Detecting protein homology using NCBI tools, Volume 8). Later, methods enabling the comparison of secondary and tertiary protein structures had also been devised (see Article 75, Protein structure comparison, Volume 7). The main use of such comparison methods is intended to get new insight into protein function by inferring functional relationships between proteins, thereby making possible the inference of function from a protein of known function to a protein of unknown function. The underlying hypothesis upon which such a functional inference is based is the evolutionary conservation of protein sequences and structures in order to perform a conserved function. But the relationship linking sequence, structure, and function is not always straightforward and although these approaches usually lead to useful and testable hypotheses, they remain a risky exercise supporting in some cases wrong conclusions (Devos and Valencia, 2000). For instance, first, a slight change in protein sequence may not be taken into account in the analysis while leading to important functional changes or second, two protein domains with no primary sequence similarity may wrongly be proposed to be functionally unrelated whereas they in fact share the same 3D structure and therefore the same function (Grossman and Laimins, 1996).


In addition, as we previously discussed (Jacq, 2001), the function of a gene/ protein is a complex notion that can be defined at several integrated levels of complexity (molecule, cell, tissue, organism levels, etc.). Generally, sequence and structure analyses solely reveal the possible molecular function(s) of proteins when domains of known function are identified in their sequences. Consequently, the functional knowledge granted by the previously described approaches only concern the biochemical role of proteins, without informing us about the particular cellular, physiological, or developmental process(es) in which it is exerted.

In order to both obtain a more contextual vision of gene/protein function and to be able to make functional predictions, computational methods relying upon genome organization have been developed. The domain fusion or Rosetta Stone method establishes that two proteins from a given organism are functionally related when they exist as a single fused polypeptide in another proteome (Enright et al., 1999; Marcotte et al., 1999a). Other methods have also been grounded on the facts that genes repeatedly found as neighbours on chromosomes in different organisms may encode functionally related proteins (Pellegrini et al., 1999) and that the phylogenetic coinheritance of proteins in several proteomes suggests their functional link (Dandekar et al., 1998; Overbeek et al., 1999; Tamames et al., 1997). Although these methods and their combinations (Marcotte et al., 1999b) were used to predict the function for a number of proteins, they still suffer from limitations essentially related to the fact that they work better when applied to completely sequenced genomes and are more appropriate to prokaryotic genome organization compared to the eukaryotic one. In addition, they are only valid for a small number of proteins and solely permit a “functional linkage” to be proposed between proteins sometimes without specifying the cellular process(es) in which the linked proteins are involved.

It then appears that providing new computational methods that enable the decoding of the cellular, the physiological, and the developmental function of gene/proteins on a large scale would not only widen the field of investigations, but more importantly would bring a necessary novel, comprehensive, and integrated understanding of gene function. Because protein action is seldom isolated but rather exerted in concert with other proteins, molecular interactions between proteins are essential actors for all biological processes in all organisms. Having access to the list of protein partners with which any given protein interacts, recapitulates the essential aspects of its cellular function, and it also provides a kind of condensed “functional identity card” for proteins. Interactions thus represent the raw material onto which new methods for protein functional descriptions could be grounded.

Recent years have seen the introduction of many different high-throughput methods, such as the DNA microarrays (see Article 90, Microarrays: an overview, Volume 4) and large-scale two-hybrid screens. Protein-protein interaction maps are now available for three eukaryotic model organisms: the budding yeast (see Article 39, The yeast interactome, Volume 5), the worm (see Article 38, The C. elegans interactome project, Volume 5), and the fly (Form-stecher et al., 2005; Giot et al., 2003). They form large intricate networks allowing a renewed vision of the cell functioning as an integrated system. However, they need to be analyzed in detail in order to extract and reveal the functional information they contain. Various methods of biological network analysis have been proposed so far. They may, for instance, allow the identification of functional modules after network clustering (Rives and Galitski, 2003), or assigning a function to proteins of unknown function on the basis of the functional annotations of their neighbors (Vazquez et al., 2003).

Another way of analyzing the interaction network is to functionally compare proteins at the cellular level. As stated above, this approach would represent a useful complement to sequence comparison methods, which addresses function at the molecular level. We thus propose a new bioinformatic method named PRODIS-TIN (Protein distance based on interactions) (Brun et al., 2003) and allowing a functional classification of the proteins according to the identity of their interacting partners. The central idea in this interaction-based functional clustering is not to compare proteins themselves but instead to compare the list of their interaction partners, assuming that the more two proteins share interacting partners, the more they should be functionally related. Let us consider three proteins A, B, C, each of them establishing 30 specific interactions (experimentally determined) with other protein partners. If A and C, B and C, and A and B have respectively 25, 13, and 2 common interactors, it seems intuitively reasonable to conclude that A and C are highly functionally related, that B and C share at least some functional features and that A and B are probably not functionally (or only marginally) related. In order to translate this rather simple hypothesis into a mathematical formalism, we decided to calculate the Czekanowski-Dice distance between the proteins forming the network. This distance, which is intended to provide a direct measurement of the functional relationships between proteins (belonging to the same multiprotein complex, the same pathway or more broadly, the same cellular process(es)), corresponds to:

tmpEA-17_thumb

in which i and j denote two proteins, Int(i) and Int(j) are the lists of their interac-tors, and A the symmetrical difference between the two sets. A key advantage of such a distance is that it increases the weight of the shared interactors by giving more weight to the similarities than to the differences and authorizes the use of a tree representation as an output of functional similarities.

In practice, starting from a list of binary protein-protein interactions,the PRODISTIN method consists of three different and successive bioinformatic steps: first, the functional distance is calculated between all possible pairs of proteins in the network; second, all distance values are clustered using a neighbor joining algorithm, leading to a classification tree; third, the tree is visualized and subdivided into formal classes. The PRODISTIN classes, which allow a powerful functional interpretation of the tree, are delimited according to tree topology and protein functional annotations (such as Gene Ontology (GO) terms (Ashburner et al., 2000)). We define them as subtrees containing at least three proteins sharing the same functional annotations and accounting for at least 50% of the class members.

Figure 1 shows a classification tree containing 602 yeast proteins, result of the application of PRODISTIN to 2946 protein-protein interactions involving 2139 proteins, that is, 38% of the Saccharomyces cerevisiae proteome. A detailed analysis of this PRODISTIN tree permitted an integrated analysis of yeast cellular processes and their crosstalks (Brun et al., 2003). Indeed, the PRODISTIN method efficiently clusters proteins involved in the same cellular process(es). On the basis of the belonging of a protein to a PRODISTIN class devoted to a particular cellular process, the classification enables to propose the involvement of the protein in this process, regardless of the current knowledge about its function. Doing so, we proposed a cellular function for 45% of the otherwise uncharacterized protein present in the tree and the involvement of proteins of known function in other functions (Brun et al. 2003).

A functional classification tree for 602 yeast proteins computed with the PRODISTIN method. PRODISTIN classes on the circular classification tree have been colored according to their corresponding "cellular role". Protein names have been omitted for clarity

Figure 1 A functional classification tree for 602 yeast proteins computed with the PRODISTIN method. PRODISTIN classes on the circular classification tree have been colored according to their corresponding “cellular role”. Protein names have been omitted for clarity

As predicting the function at a cellular level differs from predicting the function at the molecular level, classifying proteins functionally according to their interaction subnetwork differs from classifying proteins in a structural or a sequence-based manner (see Article 82, Structure comparison and protein structure classifications, Volume 6, Article 91, Classification of proteins by sequence signatures, Volume 6, and Article 92, Classification of proteins by clustering techniques, Volume 6). For instance, what should we expect from both types of classification while investigating the evolutionary fate of duplicated genes in yeast? Taking into account that duplicated genes are mainly annotated as having identical molecular function according to GO (Baudot et al., 2004), no major differences in classification should be expected using structure- and sequence-based classifications. Conversely, the PRODISTIN functional classification appeared to represent a valuable tool when studying the evolution of the function of the yeast duplicated genes, since a new type of information emerged from its use in an evolutionary perspective (Baudot et al., 2004). Indeed, comparing the surrounding subnetwork for paralo-gous proteins allows us to differentiate several types of paralogue pairs according to their classification features. Three different behaviors of the pairs of paralogues regarding the PRODISTIN classification were identified, leading to the establishment of a scale of functional divergence for the duplicated genes based on the protein-protein network analysis, independently of sequence similarity. From the less to the more divergent in cellular function, either paralogues belong to the same functional class, or to different classes devoted to the same cellular function, or finally to different classes devoted to different functions. Comparing these results with the functional information carried by GO annotations and sequence comparison, it appeared that interaction network analysis reveals functional subtleties, which are not discernible by other means.

As a conclusion, this first use of the functional PRODISTIN classification for yeast proteins in order to address a specific question relative to gene/protein function validates the approach and more broadly, emphasizes the importance of interaction networks data and their analysis in deciphering cell functioning. Considering the cellular function of genes/proteins in the context of a molecular network, as in the aforementioned study of the evolution of the function of duplicated genes, would undoubtedly represent an important issue while approaching several other important biological problems. Just to cite a few, it is likely that questions such as the functional aspects of horizontal transfer, the integration of different signaling pathways or the functional relationship linking orthologous gene/proteins from model organisms for which interaction maps are available will largely benefit from applying the PRODISTIN method. Furthermore, as protein profiling methods will progressively allow a detailed description of the different proteomes from different cell types of the same organism, comparisons of the different protein networks encoded by a single genome will soon become possible. As far as the human proteomes are concerned, it is likely that the new vision of considering diseases as perturbations of specific molecular networks, which can be studied by network analysis methods such as PRODISTIN will offer new perspectives in understanding both their molecular and their phenotypical aspects.

Next post:

Previous post: